Draw error rate from a distribution

grahamgower commented 2 years ago

E.g. from an empirical distribution observed for a given sequencing platform. These can then be returned to the user in the form of Phred-scaled base quality scores, which some genotype likelihood models use explicitly.

MoiColl commented 2 years ago

Do we have any information on how those errors are distributed?

grahamgower commented 2 years ago

It depends on the sequencing platform. I did something similar a few years ago: https://github.com/grahamgower/PP5mC#quality-score-profiles

grahamgower commented 2 years ago

For ancient DNA, the base quality scores may have been "recalibrated" using, e.g. mapDamage, to account for possible deamination at the terminal ends of reads. I'm not sure how we could incorporate this nicely without simulating the reads in conjunction with a putative ancestral genome sequence (so as to identify where possible C->T or G->A substitutions could be). But this is something that empirical datasets do, so its worth considering how we could close the gap between a simulated training dataset and the empirical dataset(s) the trained model will be applied to.

MoiColl commented 2 years ago

I was considering doing something similar to what I'm doing with mean_depth which the user can choose to input: 1) just a mean value for all haplotypic samples. This value is then used together with std_depth to sample randomly mean coverage values per sample from a normal distribution. 2) a list with the mean coverage per sample. Then, the coverage per site are drawn from a poison distribution per individual.

But I don't think it makes sense to sample error probabilities from a normal distribution. Furthermore, the errors for each nucleotide have their own biases (C>T deamination).

grahamgower commented 2 years ago

How about using a discrete distribution for the error distribution? This is what we get in practice with PHRED scores in a fastq file, so it should match up pretty well with empirical data.

MoiColl commented 2 years ago

Ok! I did some analysis to check how de quality scores for each base pair of reads are distributed in real data and how are they correlated.

1. Empirical read base-pair quality distributions

I download different fastq files from different samples

sample	age	coverage	link
Branuhi	0	~35	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR757/ERR757817/ERR757817_1.fastq.gz
Ust'-Ishim	45,000	40.70	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR566/ERR566093/ERR566093_1.fastq.gz
Young Yana	862	2.03	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR335/001/ERR3351001/ERR3351001.fastq.gz
Sunghir I	32,873	1.16	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/004/ERR2117984/ERR2117984.fastq.gz

with the command

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR757/ERR757817/ERR757817_1.fastq.gz

Then, with fastqc program, I obtained the distribution of quality scores per base of all reads in the fastq files with the following command

fastqc --nogroup ERR757817_1.fastq.gz

And obtained the desired information from the file

unzip ERR757817_1_fastqc.zip
ls ERR757817_1_fastqc/fastqc_data.txt

To read such a file I created a python function

def read_fastqc_data(file):
    quals  = []
    counts = []

    printing = False
    with open(file, "r") as file:
        for line in file:
            if ">>END_MODULE" in line and printing:
                break
            if printing:
                if "#" not in line:  
                    qual, count = line.strip().split()
                    quals.append(int(qual))
                    counts.append(int(float(count)))

            if ">>Per sequence quality scores" in line:
                printing = True

    qualsdist = pd.DataFrame({"quality" : quals,
                              "counts"  : counts})
    return qualsdist

I then obtained all the distributions for each fastq with

qualsdist_HGDP = read_fastqc_data("ERR757817_1_fastqc/fastqc_data.txt")
qualsdist_USTI = read_fastqc_data("ERR566093_1_fastqc/fastqc_data.txt")
qualsdist_YANA = read_fastqc_data("ERR3351001_fastqc/fastqc_data.txt")
qualsdist_SUNG = read_fastqc_data("ERR2117984_fastqc/fastqc_data.txt")

And plot those distributions in R


library(tidyverse)
library(ggplot2)
library(cowplot)

plot_qualdist <- fucntion(qualsdist){
  qualsdist %>%
      mutate(counts = counts/sum(counts)) %>%
      ggplot() +
      geom_bar(stat = "identity", aes(x = quality, y = counts))
}

plot_grid(plot_qualdist(qualsdist_HGDP) + xlim(c(0, 65)) + ylim(0, 0.9) + ggtitle("HGDP (35x)"), 
          plot_qualdist(qualsdist_USTI) + xlim(c(0, 65)) + ylim(0, 0.9) + ggtitle("Ust'Ishim (40,7x)"),
          plot_qualdist(qualsdist_YANA) + xlim(c(0, 65)) + ylim(0, 0.9) + ggtitle("Yana Young (2,03x)"),
          plot_qualdist(qualsdist_SUNG) + xlim(c(0, 65)) + ylim(0, 0.9) + ggtitle("Sungir I (1,16x)")) 
          ncol = 2)

qualitydist

We can see from this figure that the shape of the distributions look quite alike regardless of the coverage and age of the sample, but with different mean and variance values. This suggests to me that if we could find a parametric distribution that behaved like this, we could change the mean and variance values and draw quality values from such distribution. Otherwise, I also imagine the option to input an empirical distribution (fastqc text file output) and use that to sample quality values per base pair. If the user does not specify anything regarding the quality values, then by default we could use one of the HGDP fastq files quality distributions. A final option that the user might also have is to just introduce a quality value which will be constant for all nucleotides simulated.

2. Autocorrelation

I took the file produced from the cram file as explained in #22, and first checked the histogram of the distribution

df %>% 
    ggplot() +
    geom_histogram(aes(x = quality), binwidth = 1)

dist

It looks different from the previous distributions, probably because I took a subset of a few reads.

Then, I computed the difference in score between consecutive positions and plotted the difference as a measure of autocorrelation of the quality score between adjacent bases.

Here is the plot of this value per site in the genome

df %>% 
    group_by(readnum) %>%
    mutate(diff = quality-lead(quality)) %>%
    filter(!is.na(diff)) %>%
    ggplot() +
    geom_point(aes(x = positio, y = diff))

autocorrelationbyposition

The distribution of the value

df %>% 
    group_by(readnum) %>%
    mutate(diff = quality-lead(quality)) %>%
    filter(!is.na(diff)) %>%
    ggplot() +
    geom_histogram(aes(x = diff), bins = 20)

autocorrelationdistribution

And some summary statistics

df %>% 
    group_by(readnum) %>%
    mutate(diff = quality-lead(quality)) %>%
    filter(!is.na(diff)) %>%
    ungroup() %>%
    summarize(mean = mean(diff), var = var(diff))

>     mean   var
    <dbl> <dbl>
1 0.00104  63.6

In order to compare to what would happen if there was no autocorrelation, I permuted the quality column from the dataset and recalculated the statistic.

df$qualityperm <- sample(x = df$quality, replace = FALSE)

df %>% 
    mutate(diff = quality-qualityperm) %>%
    filter(!is.na(diff)) %>%
    summarize(mean = mean(diff), var = var(diff))

> mean     var
1    0 95.2269

The variance is smaller in the real dataset, pointing towards autocorrelation. Furthermore, I did 10,000 permutations and checked the variance value for each permutation and then compared it to the real value. None of the permutations had a value equal to or more extreme than the real one. Thus, it is very likely that the quality scores of a read are autocorrelated.

However, since this would be more complicated to simulate than autocorrelation for coverage #22, and might not be so critical for simulating Genotype Likelihoods, for now, I won't simulate such autocorrelation.

3. Differences between bam and fastq quality scores

I've taken the reads in the original cram file and created a fastq file out of it with

samtools view -T GRCh38_full_analysis_set_plus_decoy_hla.fa HGDP00001.alt_bwamem_GRCh38DH.20181023.Brahui.cram | awk '{print "@"$1"\n"$10"\n+\n"$11}' > HGDP00001.cram2fastq.fastq

I didn't let the whole thing to be processed since it is a file very big. Thus, because I cancelled the job midway, I had to remove a few lines from the end of the file. Then, I read the data with the previous function for the new

qualsdist_HGDP_CRAM = read_fastqc_data("/Users/au552345/GenomeDK/fastqsbams/HGDP00001.cram2fastq_fastqc/fastqc_data.txt")

and plot


qualsdist_HGDP %>%
    mutate(counts = counts/sum(counts)) %>%
    ggplot() +
    geom_bar(stat = "identity", aes(x = quality, y = counts), alpha = 0.5, fill = "blue") +
    geom_bar(data = qualsdist_HGDP_CRAM, stat = "identity", aes(x = quality, y = counts/sum(counts)), alpha = 0.5, fill = "red")

comparisonqualityfastqbam

We can see that the cram quality distributions have changed quite a lot: quality scores are lower in general and the distribution looks more like bimodal. In the case of simGL I think the objective is to simulate reads that are already mapped. Thus, It might be more accurate to use empirical distributions from BAM/CRAM files than fastq files. Nonetheless, I don't think it will matter too much other than just introducing variation on the quality measurement.

grahamgower commented 2 years ago

With your last plot, which colour is the mapped data and which is the unmapped? The mapped data should be a proper subset of the unmapped data, right?

MoiColl commented 2 years ago

Red is the mapped. I thought the same as you, but then, could it be that there is some recalibration of the quality scores after mapping? or that the quality value scales have been reformated to a different Phred scale?

grahamgower commented 2 years ago

I don't know, but there's something funny going on there.

RacimoLab / simGL