Open grahamgower opened 2 years ago
Maybe we could construct a Markov model to describe this, from which we could easily simulate read depths?
I agree, simGL
right now assumes that each site is totally independent from the rest of sites. But of course, in a real set up, coverage should be correlated across nearby sites. I think this is an ok assumption for distant sites since it makes things very easy. However, depending on the SNP density, I agree that we should incorporate some sort of correlation across sites.
I think we could simulate N reads of length L (normally distributed) that satisfy a mean coverage X along the genome of length G such that X = G/NL. Then, we can randomly place those reads along the genome defining a starting position per read. Finally, we can check the number of reads overlapping per site. I expect that this simulation is going to be slower to produce than the Poisson random sample that is implemented right now in sim_allelereadcount()
.
But again, this raises other questions such as error rates might be also correlated somehow and different platforms might have different error profiles... and I guess the gold question here is how accurate do we want to go to simulate GL?
I opened this issue to discuss the idea, rather than to suggest that we really should do it. Maby it's not worth the effort.
What if we fix an upper bound on the read depth, say m
, then construct an m * m
transition matrix by scanning along some bam file (or just simulating reads from a read length distribution like what you suggest). Then to simulate read depths along a sequence, we do a random walk on read depths following the transition matrix. Conceptually fairly easy, and I think it would be faster than simulating reads (but could be wrong).
Ok! I see. So, I understand that what you are suggesting is that we have a precomputation of an actual read simulation (as I described) and from that, we extract the transition probabilities. Then, from those, we perform random walks per individual so that we get the coverage per site with the "dependence" property. I think it's a great idea. The issue I see here is that if the coverage per haplotype is different, then, I guess you'll have to do a simulation per every different mean_coverage.
You're right, each haplotype will need their own chain of read depths (either simulated from reads, or from a transition matrix). But most of the time I think we'll use this to simulate many replicates (i.e. lots of training instances for a neural network), so we ought to be able to cache lots of stuff.
Here's a proof of concept. Sampling the random walk is actually a lot slower than simulating the reads (at least, with the simple model in read_depths_from_reads()
), but I'm sure we could improve things.
import matplotlib.pyplot as plt
import numpy as np
def read_depths_from_reads(
*, num_reads: int, sequence_length: int, read_length: int, seed: int
) -> np.ndarray:
rng = np.random.default_rng(seed)
# 5' positions of reads that overlap the interval [0, sequence_length).
_5p = rng.integers(low=-read_length, high=sequence_length, size=num_reads)
# Left and right positions on the interval.
left = np.maximum(0, _5p)
right = np.minimum(sequence_length, _5p + read_length + 1)
depth = np.zeros(sequence_length, dtype=int)
for a, b in zip(left, right):
depth[a:b] += 1
return depth
def depth_transition_matrix(depth: np.ndarray, max_depth: int = None) -> np.ndarray:
if max_depth is None:
max_depth = np.max(depth)
depth = np.minimum(depth, max_depth)
# Count transitions.
M = np.zeros((max_depth + 1, max_depth + 1))
for j, k in zip(depth[:-1], depth[1:]):
M[j, k] += 1
# To avoid absorbing states, add an epsilon to each count.
M += 1e-6
# Make each row sum to 1.
T = M / np.expand_dims(M.sum(axis=-1), -1)
return T
def read_depths_from_transition_matrix(
T: np.ndarray, sequence_length: int, seed: int
) -> np.ndarray:
n, m = T.shape
assert n == m
rng = np.random.default_rng(seed)
p0 = np.sum(T, axis=0) / np.sum(T)
dp = rng.choice(n, p=p0)
depth = np.zeros(sequence_length, dtype=int)
depth[0] = dp
for i in range(1, sequence_length):
dp = rng.choice(n, p=T[dp])
depth[i] = dp
return depth
depths = read_depths_from_reads(
num_reads=10_000, sequence_length=100_000, read_length=150, seed=1234
)
print(depths.mean(), depths.var(), depths.min(), depths.max())
T = depth_transition_matrix(depths)
#print(T)
limit = 1000 # just plot first 1000 bp
x = np.arange(limit)
plt.step(x, depths[:limit], linestyle="--", label="from reads")
for i in range(3):
depths = read_depths_from_transition_matrix(T, sequence_length=100_000, seed=i)
plt.step(x, depths[:limit], label=f"markov chain {i}")
plt.legend()
plt.show()
Thanks for the comment above! I think we can implement your solution. At the moment, I would just implement the "read simulation" and not the "Markov random walk" since as you say, the former is much faster and the latter is only usefull if there are many individuals with similar coverage.
Right now, the way I'm planning to implement the function is such that sim_allelereadcounts()
will call a function depth_per_haplotype_per_site()
. This function will have two options to simulate read counts per sample per site:
The issue I see with this option is that the function depth_per_haplotype_per_site()
has more variables that weren't considered before:
pos
: SNP position coordinatestart
: start of the genomic region simulatedend
: start of the genomic region simulatedl_reads
: length of each read simulated anddepth_type
: option "indep" or "reads" Since sim_allelereadcounts()
calls depth_per_haplotype_per_site()
all those options will need to be parsed by sim_allelereadcounts()
which means that the number of arguments that the user will have to define will be quite large.
Another option I was considering is to have this function as an independent function that the user can call, and then, use it's output as input for sim_allelereadcounts()
.
What do you think?
Another option I was considering is to have this function as an independent function that the user can call, and then, use it's output as input for
sim_allelereadcounts()
.
I think I prefer this, as it more cleanly separates simulating depths from allele counts.
Before you go to far with this though, maybe it would be a good idea to check a couple of things?
Ok! I've decided to check these questions comparing to real data values the two simulation strategies suggested:
First, I downloaded a cram
file from a Brahui individual from the HGDP dataset as an example and the reference genome it was mapped to:
$ wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HGDP/data/Brahui/HGDP00001/alignment/HGDP00001.alt_bwamem_GRCh38DH.20181023.Brahui.cram
$ wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa
Here are some details about the sequencing described in the HGDP paper:
Libraries were sequenced on Illumina HiSeq X machines, on single lanes for the PCR libraries and multiplexed across 12 lanes for the PCR-free libraries, producing paired-end reads of length 2×151 base-pairs (bp), a mean insert size of 447 bp and a mean coverage of 35.0X
So, I expect this individual to be around 35X with reads of length 151 bp.
To check that, I take the columns corresponding to the leftmost position of a few reads and their bp quality values
$ samtools view HGDP00001.alt_bwamem_GRCh38DH.20181023.Brahui.cram -T GRCh38_full_analysis_set_plus_decoy_hla.fa chr1:1000000-2000000 | awk '{print $4"\t"$11}' > readquality.txt
This is how the file created looks like
$ head readquality.txt
999857 070<<BBB7B0<B77BB7B<<7<<7<<<BB<<0707B<B7<B700777B77<<B7B77B<7<B<7B<BBB<BBBBBB777<0BB0B<<BB<0<B<7BBB<7B<<B<BBB<77<B7BB7BBB<<B<B<B7<BBBB<B7<B<B777<B7<BB<
999866 <B<<B7BB<BBB7B<77B'BBB70<<BBBBBBBBBB<7BB7<B<B0B<7BB<BB7<BB<<B<7B<B7B<B<0B07BBB7<BB7B00<B7770<B00BBBB<B<B<7<BBB7<B<<7077B0B7<B7770<'B70B0<B00B<0<7B07<00
999867 7<<777<BB<<7BB<B<BBBB<777BBBBBBBB777<077777B7B7BB<77B77B<BBBBBBBB7BB<7B<BB7B<BBB77B<B<BBB<7<<BBBBBBB<7BBBBB0<<7<BBBB<B<BBBBB<<<BBB<777BB7B<B7B7BB<B<7<<
999868 <B<7BB<<BB0<B77B7<<B777BB<B<B7BB<<BBBB7'BBBBBB0B<7BB7<0B7<<B7BBB770<<BB<BB<B0<BB7<0BBB7770<B7<B7BB<B0BB7B77B0BB0<770<B'B70B7777<B770'B0077<B00'B70<000'
With a python
script, for the first 10000 reads, I create a pandas
dataframe in which there is a line per bp of each read with its position and its bp quality, transformed from the phred ASCII code to a numeric value
qual = '!"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~'
phred = {}
for i, q in enumerate(qual):
phred[q] = i
readnum = []
positio = []
quality = []
end = None
with open("readquality.txt") as file:
for i, line in enumerate(file):
start, qual = line.strip().split()
start = int(start)
if end != None:
if end < start:
for p in range(end+1, start):
readnum.append(-1)
positio.append(p)
quality.append(0)
for q in qual:
readnum.append(i)
positio.append(start)
quality.append(phred[q])
start += 1
end = start
if i > 10000:
break
df = pd.DataFrame({"readnum" : readnum,
"positio" : positio,
"quality" : quality})
Finally, in R
we can plot some of the basic statistics
df %>%
count(readnum) %>%
mutate(depthx10 = trunc(n/10)) %>%
count(depthx10) %>%
mutate(min_len = (depthx10*10)+1, max_len = (depthx10+1)*10, perc = n*100/sum(n)) %>%
select(min_len, max_len, n, perc)
> min_len max_len n perc
1 1 10 5 0.05254309
2 11 20 4 0.04203447
3 21 30 8 0.08406894
4 31 40 129 1.35561160
5 41 50 18 0.18915511
6 51 60 7 0.07356032
7 61 70 10 0.10508617
8 71 80 9 0.09457755
9 81 90 10 0.10508617
10 91 100 6 0.06305170
11 101 110 6 0.06305170
12 111 120 3 0.03152585
13 121 130 3 0.03152585
14 131 140 3 0.03152585
15 141 150 2 0.02101723
16 151 160 9293 97.65657839
As expected, the majority of reads have lengths between 151 - 160. I checked manually, and they are actually all 151 lengths.
Also, as expected, the mean depth per bp is 37.7; very close to the 35X described in the method section.
df %>%
count(positio) %>%
pull(n) %>%
mean()
> [1] 37.70846
Thus, the region that I selected is a good example to check some statistics.
The profile of depth along the region selected looks like this
df %>%
count(positio) %>%
ggplot() +
geom_point(aes(x = positio, y = n)) +
geom_hline(aes(yintercept = mean(n)), color = "red") +
geom_hline(aes(yintercept = median(n)), color = "blue")
The distribution of coverage per position is the following
df %>%
count(positio) %>%
ggplot() +
geom_histogram(aes(x = n), binwidth = 1) +
geom_point(data = data.frame(x = 0:100, y = dpois(x = 0:100, lambda = 37.70846)*37498), aes(x = x, y = y), color = "red")
We can see that the fit is quite good, pointing towards Poisson distribution to be a good candidate to describe and simulate depths per base pair.
As a measure of the correlation between contiguous positions, I calculate the difference between the depth of a position and the next position. This is referred as the autocorrelation measure from now on.
df %>%
count(positio) %>%
arrange(positio) %>%
mutate(n1 = lead(n)) %>%
mutate(diff = n-n1) %>%
filter(diff < 25, diff > -25) %>%
ggplot() +
geom_point(aes(x = positio, y = diff))
We can see that they are highly correlated with differences that range between +-2 mostly
df %>%
count(positio) %>%
arrange(positio) %>%
mutate(n1 = lead(n)) %>%
mutate(diff = n-n1) %>%
filter(diff < 10, diff > -10) %>%
ggplot() +
geom_histogram(aes(x = diff), binwidth = 1)
and the mean and variance for this measure of autocorrelation is
df %>%
count(positio) %>%
arrange(positio) %>%
mutate(n1 = lead(n)) %>%
mutate(diff = n-n1) %>%
filter(!is.na(diff)) %>%
summarize(mean = mean(diff), var = var(diff))
> mean var
1 -8e-04 1.456838
If I simulate 37498 positions with a mean coverage of 37.71 from a poison distribution and compute this measure of autocorrelation, we get this picture
data.frame(pos = 1:37498,
depth = rpois(n = 37498, lambda = 37.70846)) %>%
mutate(depth1 = lead(depth)) %>%
mutate(diff = depth-depth1) %>%
ggplot() +
geom_point(aes(x = pos, y = diff))
data.frame(pos = 1:37498,
depth = rpois(n = 37498, lambda = 37.70846)) %>%
mutate(depth1 = lead(depth)) %>%
mutate(diff = depth-depth1) %>%
ggplot() +
geom_histogram(aes(x = diff), binwidth = 1)
and the values for the mean and variance are
data.frame(pos = 1:37498,
depth = rpois(n = 37498, lambda = 37.70846)) %>%
mutate(depth1 = lead(depth)) %>%
mutate(diff = depth-depth1) %>%
filter(!is.na(diff)) %>%
#filter(diff < 25, diff > -25) %>%
summarize(mean = mean(diff), var = var(diff))
> mean var
1 -0.0003733632 76.16626
We can already see, as expected, that the autocorrelation does not match between this simulation strategy and the real data. In fact, because the depths are Poisson distributed and independent, the variance (76/2 = 38, since the values calculate take positive and negative values) is the same as the mean (lambda) value. Thus, this simulation strategy is not good to simualte depths for all base pairs. Probably, the length of the reads create the autocorrelation observed in the real data that explains the smaller variance.
In python
, I simualte around 9500 reads of length 151 that map in a region of length 37498 and derive the same statistics to compare with our observed results.
seq_len = 37498
n_reads = 9527-12
read_length = 151
df_sim = np.array([int(x) for x in np.random.uniform(low=0.0, high=seq_len, size=n_reads)])
pos = []
for s in df_sim:
for i in range(s, s+read_length):
pos.append(i)
df_sim = pd.DataFrame({"pos" : np.array(pos)})
and back to R
to plot the autocorrelation statistic
df_sim %>%
count(pos) %>%
arrange(pos) %>%
mutate(n1 = lead(n)) %>%
mutate(diff = n-n1) %>%
ggplot() +
geom_point(aes(x = pos, y = diff))
df_sim %>%
count(pos) %>%
arrange(pos) %>%
mutate(n1 = lead(n)) %>%
mutate(diff = n-n1) %>%
ggplot() +
geom_histogram(aes(x = diff))
And the statistics are
df_sim %>%
count(pos) %>%
arrange(pos) %>%
mutate(n1 = lead(n)) %>%
mutate(diff = n-n1) %>%
filter(!is.na(diff)) %>%
summarize(mean = mean(diff), var = var(diff))
> mean var
1 0 0.5069734
Although the variance is smaller than the real data one, the fit is much better compared to the other simulation strategy. The difference in variance between the real data and the linked simulated data could potentially be explained by on other factors breaking the autocorrelation among neighbouring regions (GC content, repetitive content...) or just pure noise. As we can see in Figure1, there are certain regions that suddenly change depth which might be the responsible for the bigger variance in the original data. In any case, simulating the fluctuations of depth by other causes than just randomness as discussed would be much more complex to simulate and I believe it is far from the scope of our package. I believe that we are safe with the linked simulation strategy.
Sometimes, we might not want to simulate the depth for all base pairs of a given region, and only a subset (polymorphic sites). For example, in humans, the proportion of polymorphisms is 1e-4. So, what is the autocorrelation of the depths between consecutive base pairs for only polymorphic sites?
We can check the autocorrelation of depths in different subsets of the original dataset of positions. If we subsample 1% (1e-2) positions in that region, we get that
df %>%
count(positio) %>%
sample_frac(size = 0.01) %>%
arrange(positio) %>%
mutate(n1 = lead(n)) %>%
mutate(diff = n-n1) %>%
filter(diff < 50, diff > -50) %>%
summarize(mean = mean(diff), var = var(diff))
> mean var
1 -0.0802139 68.1276
We can see that the variance value of our measure for autocorrelation is much more similar to the simple independent simulation strategy. Thus, this indicates that for spaced positions in the genome simulated, it is ok to use the naive Poisson distribution. However, if we want to simulate all positions in a given region, we might want to use the linked strategy. There might be a direct relationship between the level of autocorrelation and the length of reads and the spacing between positions and from that, we could decide which strategy to use. However, it is much easier to just recommend the linked strategy of simulations in case of doubt (at some point, I'll test the running time for both strategies).
Cool! This is very useful! Let's discuss in person this afternoon.
I calculated depths with samtools depth -a
for the .cram file mentioned above and made autocorrelation plots. My guess is that long stretches of zero depth are an important factor in the autocorrelation.
import numpy as np
def autocovariance(x: np.ndarray):
"""Return the autocovariance (ACVF) of 1-d array x."""
# Centre the array.
x = x - np.mean(x)
# Do ACV in transform space.
n = 2 ** int(np.ceil(np.log2(len(x))))
x_trans = np.fft.fft(x, n=2 * n)
acv_trans = x_trans * np.conjugate(x_trans)
return np.fft.ifft(acv_trans)[: len(x)].real
def nonzero_segments(depths: np.ndarray, gapsize=10000, maxsegsize=5*1000*1000):
"""
Break the input into segments of nonzero values.
Segment breakpoints occur at runs of zeros longer than `gapsize`.
For ACVF computational tractibility (max memory needed), segments are also
broken into pieces of length `maxsegsize`.
"""
nz = np.nonzero(depths)[0]
dp_list = []
start = 0
i = 0
for j in nz:
if j - i > gapsize:
if start != i:
seg = depths[start: i]
segs = np.array_split(seg, 1 + len(seg) // maxsegsize)
dp_list.extend(segs)
start = j
i = j
return dp_list
if __name__ == "__main__":
import sys
import matplotlib.pyplot as plt
if len(sys.argv) != 3:
print(f"usage: {sys.argv[0]} depths.tsv acf.pdf")
exit(1)
file_in = sys.argv[1]
plot_out = sys.argv[2]
# Depths from:
# samtools depth -a -r chr22 -o chr22_depths.tsv file.bam
depths = np.loadtxt(file_in, usecols=2, dtype=int)
# Don't bother with the ACF past this distance.
dist_max = 7500
dp_list = nonzero_segments(depths)
acf = np.zeros(dist_max)
for dp in dp_list:
if len(dp) < dist_max:
continue
acf += autocovariance(dp)[:dist_max]
# Normalise to get autocorrelation.
acf /= acf[0]
fig, ax = plt.subplots(figsize=plt.figaspect(9 / 16))
fig.set_tight_layout(True)
ax.plot(acf)
ax.set_ylim([-0.01, 1.01])
ax.set_ylabel("Autocorrelation")
ax.set_xlabel("Distance (bp)")
ax.set_title("Autocorrelation of sequencing depth")
# Read length and mean insert size.
ax.vlines([151, 447], *ax.get_ylim(), colors="black", linestyles=":")
fig.savefig(plot_out)
That's great! I can see you marked with vertical lines 151 and 447 bp distance. I guess another interesting distance point would be (151*2)+447 = 749 bp since it's the total length of a pair end read. But I would say that the sharpest decrease in autocorrelation can be seen in the first 151 bp (we can see a change after the first horizontal dotted line, especially in chromosome 22). It's curious that the autocorrelation is quite different for each chromosome: in chromosome 22 the autocorrelation drops much faster than chromosome 1.
In any case, we can try to implement both solutions: independent (Poisson) depths and linked depths simulating reads, the former for spaced polymorphic (and error) positions and the latter to simulate the entire desired region.
There must be some strong correlation in read depth along the sequence, which drops off after
n
bases, wheren
is some function of the mean read length. E.g. two sites that are adjacent along the genome will have very similar read depths.