There must be some strong correlation in read depth along the sequence, which drops off after n bases, where n is some function of the mean read length. E.g. two sites that are adjacent along the genome will have very similar read depths.

Maybe we could construct a Markov model to describe this, from which we could easily simulate read depths?

I agree, simGL right now assumes that each site is totally independent from the rest of sites. But of course, in a real set up, coverage should be correlated across nearby sites. I think this is an ok assumption for distant sites since it makes things very easy. However, depending on the SNP density, I agree that we should incorporate some sort of correlation across sites.

I think we could simulate N reads of length L (normally distributed) that satisfy a mean coverage X along the genome of length G such that X = G/NL. Then, we can randomly place those reads along the genome defining a starting position per read. Finally, we can check the number of reads overlapping per site. I expect that this simulation is going to be slower to produce than the Poisson random sample that is implemented right now in sim_allelereadcount().

But again, this raises other questions such as error rates might be also correlated somehow and different platforms might have different error profiles... and I guess the gold question here is how accurate do we want to go to simulate GL?

I opened this issue to discuss the idea, rather than to suggest that we really should do it. Maby it's not worth the effort.

What if we fix an upper bound on the read depth, say m, then construct an m * m transition matrix by scanning along some bam file (or just simulating reads from a read length distribution like what you suggest). Then to simulate read depths along a sequence, we do a random walk on read depths following the transition matrix. Conceptually fairly easy, and I think it would be faster than simulating reads (but could be wrong).

Ok! I see. So, I understand that what you are suggesting is that we have a precomputation of an actual read simulation (as I described) and from that, we extract the transition probabilities. Then, from those, we perform random walks per individual so that we get the coverage per site with the "dependence" property. I think it's a great idea. The issue I see here is that if the coverage per haplotype is different, then, I guess you'll have to do a simulation per every different mean_coverage.

You're right, each haplotype will need their own chain of read depths (either simulated from reads, or from a transition matrix). But most of the time I think we'll use this to simulate many replicates (i.e. lots of training instances for a neural network), so we ought to be able to cache lots of stuff.

Here's a proof of concept. Sampling the random walk is actually a lot slower than simulating the reads (at least, with the simple model in read_depths_from_reads()), but I'm sure we could improve things.

import matplotlib.pyplot as plt
import numpy as np

def read_depths_from_reads(
    *, num_reads: int, sequence_length: int, read_length: int, seed: int
) -> np.ndarray:
    rng = np.random.default_rng(seed)
    # 5' positions of reads that overlap the interval [0, sequence_length).
    _5p = rng.integers(low=-read_length, high=sequence_length, size=num_reads)
    # Left and right positions on the interval.
    left = np.maximum(0, _5p)
    right = np.minimum(sequence_length, _5p + read_length + 1)
    depth = np.zeros(sequence_length, dtype=int)
    for a, b in zip(left, right):
        depth[a:b] += 1
    return depth

def depth_transition_matrix(depth: np.ndarray, max_depth: int = None) -> np.ndarray:
    if max_depth is None:
        max_depth = np.max(depth)
    depth = np.minimum(depth, max_depth)
    # Count transitions.
    M = np.zeros((max_depth + 1, max_depth + 1))
    for j, k in zip(depth[:-1], depth[1:]):
        M[j, k] += 1
    # To avoid absorbing states, add an epsilon to each count.
    M += 1e-6
    # Make each row sum to 1.
    T = M / np.expand_dims(M.sum(axis=-1), -1)
    return T

def read_depths_from_transition_matrix(
    T: np.ndarray, sequence_length: int, seed: int
) -> np.ndarray:
    n, m = T.shape
    assert n == m
    rng = np.random.default_rng(seed)
    p0 = np.sum(T, axis=0) / np.sum(T)
    dp = rng.choice(n, p=p0)
    depth = np.zeros(sequence_length, dtype=int)
    depth[0] = dp
    for i in range(1, sequence_length):
        dp = rng.choice(n, p=T[dp])
        depth[i] = dp
    return depth

depths = read_depths_from_reads(
    num_reads=10_000, sequence_length=100_000, read_length=150, seed=1234
)
print(depths.mean(), depths.var(), depths.min(), depths.max())

T = depth_transition_matrix(depths)
#print(T)

limit = 1000  # just plot first 1000 bp
x = np.arange(limit)
plt.step(x, depths[:limit], linestyle="--", label="from reads")
for i in range(3):
    depths = read_depths_from_transition_matrix(T, sequence_length=100_000, seed=i)
    plt.step(x, depths[:limit], label=f"markov chain {i}")

plt.legend()
plt.show()

depth_markov_chain

Thanks for the comment above! I think we can implement your solution. At the moment, I would just implement the "read simulation" and not the "Markov random walk" since as you say, the former is much faster and the latter is only usefull if there are many individuals with similar coverage.

Right now, the way I'm planning to implement the function is such that sim_allelereadcounts() will call a function depth_per_haplotype_per_site(). This function will have two options to simulate read counts per sample per site:

"indep" optionas: read counts are drawn from a Poisson distribution (as it was until now)
"reads" option: this is your solution, in which read counts per site are simulated by first simuating reads of a certain length and scatter them in a genomic region region (defined by start and end).

The issue I see with this option is that the function depth_per_haplotype_per_site() has more variables that weren't considered before:

pos : SNP position coordinate
start : start of the genomic region simulated
end : start of the genomic region simulated
l_reads : length of each read simulated and
depth_type : option "indep" or "reads"

Since sim_allelereadcounts() calls depth_per_haplotype_per_site() all those options will need to be parsed by sim_allelereadcounts() which means that the number of arguments that the user will have to define will be quite large.

Another option I was considering is to have this function as an independent function that the user can call, and then, use it's output as input for sim_allelereadcounts().

What do you think?

Another option I was considering is to have this function as an independent function that the user can call, and then, use it's output as input for sim_allelereadcounts().

I think I prefer this, as it more cleanly separates simulating depths from allele counts.

Before you go to far with this though, maybe it would be a good idea to check a couple of things?

What is the autocorrelation of read depths in empirical data? Probably we care about modern data (fixed read lengths) and ancient data (with log-normal-ish read length distribution)? It would be nice to understand under which circumstances it is useful to model non-independence of read depths. I.e. for what average read length and what density of segregating sites do we care?
Does the spatial distribution of depths in empirical data resemble what we've simulated? I suspect that in ancient DNA, data is wildly biased by sequence composition (e.g. much higher depths in GC rich regions). In particular, we probably underestimate the empirical variance in read depth by using the simple read depth simulations above.

Ok! I've decided to check these questions comparing to real data values the two simulation strategies suggested:

Independent: Drawing the depth per position from a Poisson distribution with a certain lambda
Linked: Simulating reads by sampling their starting location from a uniform distribution (start and end position fixed) and calculating the depth per position by checking the number of reads spanning each position

1. Check the statistics in a modern sample

First, I downloaded a cram file from a Brahui individual from the HGDP dataset as an example and the reference genome it was mapped to:

$ wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HGDP/data/Brahui/HGDP00001/alignment/HGDP00001.alt_bwamem_GRCh38DH.20181023.Brahui.cram
$ wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa

Here are some details about the sequencing described in the HGDP paper:

Libraries were sequenced on Illumina HiSeq X machines, on single lanes for the PCR libraries and multiplexed across 12 lanes for the PCR-free libraries, producing paired-end reads of length 2×151 base-pairs (bp), a mean insert size of 447 bp and a mean coverage of 35.0X

So, I expect this individual to be around 35X with reads of length 151 bp.

To check that, I take the columns corresponding to the leftmost position of a few reads and their bp quality values

$ samtools view HGDP00001.alt_bwamem_GRCh38DH.20181023.Brahui.cram -T GRCh38_full_analysis_set_plus_decoy_hla.fa chr1:1000000-2000000 | awk '{print $4"\t"$11}' > readquality.txt

This is how the file created looks like

$ head readquality.txt
999857  070<<BBB7B0<B77BB7B<<7<<7<<<BB<<0707B<B7<B700777B77<<B7B77B<7<B<7B<BBB<BBBBBB777<0BB0B<<BB<0<B<7BBB<7B<<B<BBB<77<B7BB7BBB<<B<B<B7<BBBB<B7<B<B777<B7<BB<
999866  <B<<B7BB<BBB7B<77B'BBB70<<BBBBBBBBBB<7BB7<B<B0B<7BB<BB7<BB<<B<7B<B7B<B<0B07BBB7<BB7B00<B7770<B00BBBB<B<B<7<BBB7<B<<7077B0B7<B7770<'B70B0<B00B<0<7B07<00
999867  7<<777<BB<<7BB<B<BBBB<777BBBBBBBB777<077777B7B7BB<77B77B<BBBBBBBB7BB<7B<BB7B<BBB77B<B<BBB<7<<BBBBBBB<7BBBBB0<<7<BBBB<B<BBBBB<<<BBB<777BB7B<B7B7BB<B<7<<
999868  <B<7BB<<BB0<B77B7<<B777BB<B<B7BB<<BBBB7'BBBBBB0B<7BB7<0B7<<B7BBB770<<BB<BB<B0<BB7<0BBB7770<B7<B7BB<B0BB7B77B0BB0<770<B'B70B7777<B770'B0077<B00'B70<000'

With a python script, for the first 10000 reads, I create a pandas dataframe in which there is a line per bp of each read with its position and its bp quality, transformed from the phred ASCII code to a numeric value

qual = '!"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~'
phred = {}

for i, q in enumerate(qual):
    phred[q] = i

readnum = []
positio = []
quality = []
end = None
with open("readquality.txt") as file:
    for i, line in enumerate(file):
        start, qual = line.strip().split()
        start = int(start)

        if end != None:
            if end < start:
                for p in range(end+1, start):
                    readnum.append(-1)
                    positio.append(p)
                    quality.append(0)

        for q in qual:
            readnum.append(i)
            positio.append(start)
            quality.append(phred[q])
            start += 1
            end = start
        if i > 10000:
            break

df = pd.DataFrame({"readnum" : readnum,
                   "positio" : positio,
                   "quality" : quality})

Finally, in R we can plot some of the basic statistics

Read length

df %>% 
    count(readnum) %>%
    mutate(depthx10 = trunc(n/10)) %>%
    count(depthx10) %>%
    mutate(min_len = (depthx10*10)+1, max_len = (depthx10+1)*10, perc = n*100/sum(n)) %>%
    select(min_len, max_len, n, perc)

>  min_len max_len    n        perc
1        1      10    5  0.05254309
2       11      20    4  0.04203447
3       21      30    8  0.08406894
4       31      40  129  1.35561160
5       41      50   18  0.18915511
6       51      60    7  0.07356032
7       61      70   10  0.10508617
8       71      80    9  0.09457755
9       81      90   10  0.10508617
10      91     100    6  0.06305170
11     101     110    6  0.06305170
12     111     120    3  0.03152585
13     121     130    3  0.03152585
14     131     140    3  0.03152585
15     141     150    2  0.02101723
16     151     160 9293 97.65657839

As expected, the majority of reads have lengths between 151 - 160. I checked manually, and they are actually all 151 lengths.

Mean depth

Also, as expected, the mean depth per bp is 37.7; very close to the 35X described in the method section.

df %>% 
    count(positio) %>%
    pull(n) %>%
    mean()

> [1] 37.70846

Thus, the region that I selected is a good example to check some statistics.

Depth per base pair and distribution of depths per base pair

The profile of depth along the region selected looks like this

df %>% 
    count(positio) %>%
    ggplot() +
    geom_point(aes(x = positio, y = n)) +
    geom_hline(aes(yintercept = mean(n)), color = "red") +
    geom_hline(aes(yintercept = median(n)), color = "blue")

Captura de pantalla 2022-06-15 a les 15 57 50

Figure1. Depth (y-axis) per bp on the region selected (x-axis). The mean and the median are shown with horizontal red and blue lines.

The distribution of coverage per position is the following

df %>% 
    count(positio) %>%
    ggplot() +
    geom_histogram(aes(x = n), binwidth = 1) +
    geom_point(data = data.frame(x = 0:100, y = dpois(x = 0:100, lambda = 37.70846)*37498), aes(x = x, y = y), color = "red")

Captura de pantalla 2022-06-15 a les 16 01 26

Figure2. Histogram of the distribution of depths in the positions of the region selected. In red dots, it is shown the expected distribution of the depth for 37498 positions drawn from a Poisson distribution with a lambda value of 37.70846.

We can see that the fit is quite good, pointing towards Poisson distribution to be a good candidate to describe and simulate depths per base pair.

Depth correlation between contiguous positions

As a measure of the correlation between contiguous positions, I calculate the difference between the depth of a position and the next position. This is referred as the autocorrelation measure from now on.

df %>% 
    count(positio) %>%
    arrange(positio) %>%
    mutate(n1 = lead(n)) %>%
    mutate(diff = n-n1) %>%
    filter(diff < 25, diff > -25) %>%
    ggplot() +
    geom_point(aes(x = positio, y = diff))

Captura de pantalla 2022-06-15 a les 16 08 41

Figure3. Difference of depths between positions in n and n+1 for every position in the region.

We can see that they are highly correlated with differences that range between +-2 mostly

df %>% 
    count(positio) %>%
    arrange(positio) %>%
    mutate(n1 = lead(n)) %>%
    mutate(diff = n-n1) %>%
    filter(diff < 10, diff > -10) %>%
    ggplot() +
    geom_histogram(aes(x = diff), binwidth = 1)

Captura de pantalla 2022-06-15 a les 16 10 45

Figure4. Histogram of the difference of depths between positions in n and n+1.

and the mean and variance for this measure of autocorrelation is

df %>% 
    count(positio) %>%
    arrange(positio) %>%
    mutate(n1 = lead(n)) %>%
    mutate(diff = n-n1) %>%
    filter(!is.na(diff)) %>%
    summarize(mean = mean(diff), var = var(diff))

>   mean      var
1 -8e-04 1.456838

2. Check the statistics in an independent simulation

If I simulate 37498 positions with a mean coverage of 37.71 from a poison distribution and compute this measure of autocorrelation, we get this picture

data.frame(pos = 1:37498,
           depth = rpois(n = 37498, lambda = 37.70846)) %>%
    mutate(depth1 = lead(depth)) %>%
    mutate(diff = depth-depth1) %>%
    ggplot() +
    geom_point(aes(x = pos, y = diff))

Captura de pantalla 2022-06-15 a les 16 16 12

Figure5. Difference of depths between positions n and n+1 for every position in the independent simulated region.

data.frame(pos = 1:37498,
           depth = rpois(n = 37498, lambda = 37.70846)) %>%
    mutate(depth1 = lead(depth)) %>%
    mutate(diff = depth-depth1) %>%
    ggplot() +
    geom_histogram(aes(x = diff), binwidth = 1)

Captura de pantalla 2022-06-15 a les 16 27 09

Figure6. Histogram of the difference of depths between positions in n and n+1 of the independent simulation.

and the values for the mean and variance are

data.frame(pos = 1:37498,
           depth = rpois(n = 37498, lambda = 37.70846)) %>%
    mutate(depth1 = lead(depth)) %>%
    mutate(diff = depth-depth1) %>% 
    filter(!is.na(diff)) %>%
    #filter(diff < 25, diff > -25) %>%
    summarize(mean = mean(diff), var = var(diff))

>          mean      var
1 -0.0003733632 76.16626

We can already see, as expected, that the autocorrelation does not match between this simulation strategy and the real data. In fact, because the depths are Poisson distributed and independent, the variance (76/2 = 38, since the values calculate take positive and negative values) is the same as the mean (lambda) value. Thus, this simulation strategy is not good to simualte depths for all base pairs. Probably, the length of the reads create the autocorrelation observed in the real data that explains the smaller variance.

3. Check the statistics in an linked simulation

In python, I simualte around 9500 reads of length 151 that map in a region of length 37498 and derive the same statistics to compare with our observed results.

seq_len = 37498
n_reads = 9527-12
read_length = 151
df_sim = np.array([int(x) for x in np.random.uniform(low=0.0, high=seq_len, size=n_reads)])
pos = []
for s in df_sim:
    for i in range(s, s+read_length):
        pos.append(i)

df_sim = pd.DataFrame({"pos" : np.array(pos)})

and back to R to plot the autocorrelation statistic


df_sim %>% 
    count(pos) %>%
    arrange(pos) %>%
    mutate(n1 = lead(n)) %>%
    mutate(diff = n-n1) %>%
    ggplot() +
    geom_point(aes(x = pos, y = diff))

Captura de pantalla 2022-06-15 a les 16 23 06

Figure7. Difference of depths between positions n and n+1 for every position in the Linked simulated region.

df_sim %>% 
    count(pos) %>%
    arrange(pos) %>%
    mutate(n1 = lead(n)) %>%
    mutate(diff = n-n1) %>%
    ggplot() +
    geom_histogram(aes(x = diff))

Captura de pantalla 2022-06-15 a les 16 25 02

Figure8. Histogram of the difference of depths between positions in n and n+1 of the Linked simulated region.

And the statistics are

df_sim %>% 
    count(pos) %>%
    arrange(pos) %>%
    mutate(n1 = lead(n)) %>%
    mutate(diff = n-n1) %>%
    filter(!is.na(diff)) %>%
    summarize(mean = mean(diff), var = var(diff))

>     mean       var
1    0 0.5069734

Although the variance is smaller than the real data one, the fit is much better compared to the other simulation strategy. The difference in variance between the real data and the linked simulated data could potentially be explained by on other factors breaking the autocorrelation among neighbouring regions (GC content, repetitive content...) or just pure noise. As we can see in Figure1, there are certain regions that suddenly change depth which might be the responsible for the bigger variance in the original data. In any case, simulating the fluctuations of depth by other causes than just randomness as discussed would be much more complex to simulate and I believe it is far from the scope of our package. I believe that we are safe with the linked simulation strategy.

4. Check the statistics in a subsample of the whole observed region

Sometimes, we might not want to simulate the depth for all base pairs of a given region, and only a subset (polymorphic sites). For example, in humans, the proportion of polymorphisms is 1e-4. So, what is the autocorrelation of the depths between consecutive base pairs for only polymorphic sites?

We can check the autocorrelation of depths in different subsets of the original dataset of positions. If we subsample 1% (1e-2) positions in that region, we get that

df %>%
    count(positio) %>%
    sample_frac(size = 0.01) %>%
    arrange(positio) %>%
    mutate(n1 = lead(n)) %>%
    mutate(diff = n-n1) %>%
    filter(diff < 50, diff > -50) %>%
    summarize(mean = mean(diff), var = var(diff))

>       mean     var
1 -0.0802139 68.1276

We can see that the variance value of our measure for autocorrelation is much more similar to the simple independent simulation strategy. Thus, this indicates that for spaced positions in the genome simulated, it is ok to use the naive Poisson distribution. However, if we want to simulate all positions in a given region, we might want to use the linked strategy. There might be a direct relationship between the level of autocorrelation and the length of reads and the spacing between positions and from that, we could decide which strategy to use. However, it is much easier to just recommend the linked strategy of simulations in case of doubt (at some point, I'll test the running time for both strategies).

Cool! This is very useful! Let's discuss in person this afternoon.

I calculated depths with samtools depth -a for the .cram file mentioned above and made autocorrelation plots. My guess is that long stretches of zero depth are an important factor in the autocorrelation.

import numpy as np

def autocovariance(x: np.ndarray):
    """Return the autocovariance (ACVF) of 1-d array x."""
    # Centre the array.
    x = x - np.mean(x)
    # Do ACV in transform space.
    n = 2 ** int(np.ceil(np.log2(len(x))))
    x_trans = np.fft.fft(x, n=2 * n)
    acv_trans = x_trans * np.conjugate(x_trans)
    return np.fft.ifft(acv_trans)[: len(x)].real

def nonzero_segments(depths: np.ndarray, gapsize=10000, maxsegsize=5*1000*1000):
    """
    Break the input into segments of nonzero values.

    Segment breakpoints occur at runs of zeros longer than `gapsize`.
    For ACVF computational tractibility (max memory needed), segments are also
    broken into pieces of length `maxsegsize`.
    """
    nz = np.nonzero(depths)[0]
    dp_list = []
    start = 0
    i = 0
    for j in nz:
        if j - i > gapsize:
            if start != i:
                seg = depths[start: i]
                segs = np.array_split(seg, 1 + len(seg) // maxsegsize)
                dp_list.extend(segs)
            start = j
        i = j
    return dp_list

if __name__ == "__main__":
    import sys
    import matplotlib.pyplot as plt

    if len(sys.argv) != 3:
        print(f"usage: {sys.argv[0]} depths.tsv acf.pdf")
        exit(1)

    file_in = sys.argv[1]
    plot_out = sys.argv[2]

    # Depths from:
    #   samtools depth -a -r chr22 -o chr22_depths.tsv file.bam
    depths = np.loadtxt(file_in, usecols=2, dtype=int)

    # Don't bother with the ACF past this distance.
    dist_max = 7500

    dp_list = nonzero_segments(depths)
    acf = np.zeros(dist_max)
    for dp in dp_list:
        if len(dp) < dist_max:
            continue
        acf += autocovariance(dp)[:dist_max]
    # Normalise to get autocorrelation.
    acf /= acf[0]

    fig, ax = plt.subplots(figsize=plt.figaspect(9 / 16))
    fig.set_tight_layout(True)
    ax.plot(acf)
    ax.set_ylim([-0.01, 1.01])
    ax.set_ylabel("Autocorrelation")
    ax.set_xlabel("Distance (bp)")
    ax.set_title("Autocorrelation of sequencing depth")

    # Read length and mean insert size.
    ax.vlines([151, 447], *ax.get_ylim(), colors="black", linestyles=":")

    fig.savefig(plot_out)

chr22

chr22_depths

chr1

chr1_depths

That's great! I can see you marked with vertical lines 151 and 447 bp distance. I guess another interesting distance point would be (151*2)+447 = 749 bp since it's the total length of a pair end read. But I would say that the sharpest decrease in autocorrelation can be seen in the first 151 bp (we can see a change after the first horizontal dotted line, especially in chromosome 22). It's curious that the autocorrelation is quite different for each chromosome: in chromosome 22 the autocorrelation drops much faster than chromosome 1.

In any case, we can try to implement both solutions: independent (Poisson) depths and linked depths simulating reads, the former for spaced polymorphic (and error) positions and the latter to simulate the entire desired region.

RacimoLab / simGL

Read depth at distinct sites is non-independent #22