Open grahamgower opened 2 years ago
Do we have any information on how those errors are distributed?
It depends on the sequencing platform. I did something similar a few years ago: https://github.com/grahamgower/PP5mC#quality-score-profiles
For ancient DNA, the base quality scores may have been "recalibrated" using, e.g. mapDamage, to account for possible deamination at the terminal ends of reads. I'm not sure how we could incorporate this nicely without simulating the reads in conjunction with a putative ancestral genome sequence (so as to identify where possible C->T or G->A substitutions could be). But this is something that empirical datasets do, so its worth considering how we could close the gap between a simulated training dataset and the empirical dataset(s) the trained model will be applied to.
I was considering doing something similar to what I'm doing with mean_depth
which the user can choose to input:
1) just a mean value for all haplotypic samples. This value is then used together with std_depth
to sample randomly mean coverage values per sample from a normal distribution.
2) a list with the mean coverage per sample. Then, the coverage per site are drawn from a poison distribution per individual.
But I don't think it makes sense to sample error probabilities from a normal distribution. Furthermore, the errors for each nucleotide have their own biases (C>T deamination).
How about using a discrete distribution for the error distribution? This is what we get in practice with PHRED scores in a fastq file, so it should match up pretty well with empirical data.
Ok! I did some analysis to check how de quality scores for each base pair of reads are distributed in real data and how are they correlated.
I download different fastq files from different samples
sample | age | coverage | link |
---|---|---|---|
Branuhi | 0 | ~35 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR757/ERR757817/ERR757817_1.fastq.gz |
Ust'-Ishim | 45,000 | 40.70 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR566/ERR566093/ERR566093_1.fastq.gz |
Young Yana | 862 | 2.03 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR335/001/ERR3351001/ERR3351001.fastq.gz |
Sunghir I | 32,873 | 1.16 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/004/ERR2117984/ERR2117984.fastq.gz |
with the command
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR757/ERR757817/ERR757817_1.fastq.gz
Then, with fastqc
program, I obtained the distribution of quality scores per base of all reads in the fastq files with the following command
fastqc --nogroup ERR757817_1.fastq.gz
And obtained the desired information from the file
unzip ERR757817_1_fastqc.zip
ls ERR757817_1_fastqc/fastqc_data.txt
To read such a file I created a python
function
def read_fastqc_data(file):
quals = []
counts = []
printing = False
with open(file, "r") as file:
for line in file:
if ">>END_MODULE" in line and printing:
break
if printing:
if "#" not in line:
qual, count = line.strip().split()
quals.append(int(qual))
counts.append(int(float(count)))
if ">>Per sequence quality scores" in line:
printing = True
qualsdist = pd.DataFrame({"quality" : quals,
"counts" : counts})
return qualsdist
I then obtained all the distributions for each fastq with
qualsdist_HGDP = read_fastqc_data("ERR757817_1_fastqc/fastqc_data.txt")
qualsdist_USTI = read_fastqc_data("ERR566093_1_fastqc/fastqc_data.txt")
qualsdist_YANA = read_fastqc_data("ERR3351001_fastqc/fastqc_data.txt")
qualsdist_SUNG = read_fastqc_data("ERR2117984_fastqc/fastqc_data.txt")
And plot those distributions in R
library(tidyverse)
library(ggplot2)
library(cowplot)
plot_qualdist <- fucntion(qualsdist){
qualsdist %>%
mutate(counts = counts/sum(counts)) %>%
ggplot() +
geom_bar(stat = "identity", aes(x = quality, y = counts))
}
plot_grid(plot_qualdist(qualsdist_HGDP) + xlim(c(0, 65)) + ylim(0, 0.9) + ggtitle("HGDP (35x)"),
plot_qualdist(qualsdist_USTI) + xlim(c(0, 65)) + ylim(0, 0.9) + ggtitle("Ust'Ishim (40,7x)"),
plot_qualdist(qualsdist_YANA) + xlim(c(0, 65)) + ylim(0, 0.9) + ggtitle("Yana Young (2,03x)"),
plot_qualdist(qualsdist_SUNG) + xlim(c(0, 65)) + ylim(0, 0.9) + ggtitle("Sungir I (1,16x)"))
ncol = 2)
We can see from this figure that the shape of the distributions look quite alike regardless of the coverage and age of the sample, but with different mean and variance values. This suggests to me that if we could find a parametric distribution that behaved like this, we could change the mean and variance values and draw quality values from such distribution. Otherwise, I also imagine the option to input an empirical distribution (fastqc text file output) and use that to sample quality values per base pair. If the user does not specify anything regarding the quality values, then by default we could use one of the HGDP fastq files quality distributions. A final option that the user might also have is to just introduce a quality value which will be constant for all nucleotides simulated.
I took the file produced from the cram file as explained in #22, and first checked the histogram of the distribution
df %>%
ggplot() +
geom_histogram(aes(x = quality), binwidth = 1)
It looks different from the previous distributions, probably because I took a subset of a few reads.
Then, I computed the difference in score between consecutive positions and plotted the difference as a measure of autocorrelation of the quality score between adjacent bases.
Here is the plot of this value per site in the genome
df %>%
group_by(readnum) %>%
mutate(diff = quality-lead(quality)) %>%
filter(!is.na(diff)) %>%
ggplot() +
geom_point(aes(x = positio, y = diff))
The distribution of the value
df %>%
group_by(readnum) %>%
mutate(diff = quality-lead(quality)) %>%
filter(!is.na(diff)) %>%
ggplot() +
geom_histogram(aes(x = diff), bins = 20)
And some summary statistics
df %>%
group_by(readnum) %>%
mutate(diff = quality-lead(quality)) %>%
filter(!is.na(diff)) %>%
ungroup() %>%
summarize(mean = mean(diff), var = var(diff))
> mean var
<dbl> <dbl>
1 0.00104 63.6
In order to compare to what would happen if there was no autocorrelation, I permuted the quality column from the dataset and recalculated the statistic.
df$qualityperm <- sample(x = df$quality, replace = FALSE)
df %>%
mutate(diff = quality-qualityperm) %>%
filter(!is.na(diff)) %>%
summarize(mean = mean(diff), var = var(diff))
> mean var
1 0 95.2269
The variance is smaller in the real dataset, pointing towards autocorrelation. Furthermore, I did 10,000 permutations and checked the variance value for each permutation and then compared it to the real value. None of the permutations had a value equal to or more extreme than the real one. Thus, it is very likely that the quality scores of a read are autocorrelated.
However, since this would be more complicated to simulate than autocorrelation for coverage #22, and might not be so critical for simulating Genotype Likelihoods, for now, I won't simulate such autocorrelation.
I've taken the reads in the original cram file and created a fastq file out of it with
samtools view -T GRCh38_full_analysis_set_plus_decoy_hla.fa HGDP00001.alt_bwamem_GRCh38DH.20181023.Brahui.cram | awk '{print "@"$1"\n"$10"\n+\n"$11}' > HGDP00001.cram2fastq.fastq
I didn't let the whole thing to be processed since it is a file very big. Thus, because I cancelled the job midway, I had to remove a few lines from the end of the file. Then, I read the data with the previous function for the new
qualsdist_HGDP_CRAM = read_fastqc_data("/Users/au552345/GenomeDK/fastqsbams/HGDP00001.cram2fastq_fastqc/fastqc_data.txt")
and plot
qualsdist_HGDP %>%
mutate(counts = counts/sum(counts)) %>%
ggplot() +
geom_bar(stat = "identity", aes(x = quality, y = counts), alpha = 0.5, fill = "blue") +
geom_bar(data = qualsdist_HGDP_CRAM, stat = "identity", aes(x = quality, y = counts/sum(counts)), alpha = 0.5, fill = "red")
We can see that the cram quality distributions have changed quite a lot: quality scores are lower in general and the distribution looks more like bimodal. In the case of simGL
I think the objective is to simulate reads that are already mapped. Thus, It might be more accurate to use empirical distributions from BAM/CRAM files than fastq files. Nonetheless, I don't think it will matter too much other than just introducing variation on the quality measurement.
With your last plot, which colour is the mapped data and which is the unmapped? The mapped data should be a proper subset of the unmapped data, right?
Red is the mapped. I thought the same as you, but then, could it be that there is some recalibration of the quality scores after mapping? or that the quality value scales have been reformated to a different Phred scale?
I don't know, but there's something funny going on there.
E.g. from an empirical distribution observed for a given sequencing platform. These can then be returned to the user in the form of Phred-scaled base quality scores, which some genotype likelihood models use explicitly.