Open HenrikSpiegel opened 2 years ago
From ART documentation: ART article supplementary it was stated that:
the overall error rate of single-end 35bp reads from Illumina Genome Analyzer I (GA-I) is 2.26%
However, no documentation on the mbarc profile or other ART profiles overall error rate has been found so far.
We couldn't use the GSA to get error rate as it is "perfect" maps that exist in those bam files.
However using the following we were able to approximate the error rate for WGSIM:
# 3% error
minimap2 -t 30 -a input_genomes/combined.fa sim/0_2GB/sample_0/reads/anonymous_reads.fq.gz > mapped/aln_s0.sam
minimap2 -t 30 -a input_genomes/combined.fa sim/0_2GB/sample_1/reads/anonymous_reads.fq.gz > mapped/aln_s1.sam
# 5% error
minimap2 -t 30 -a input_genomes/combined.fa sim/0_3GB/sample_0/reads/anonymous_reads.fq.gz > mapped/aln_s0_5.sam
minimap2 -t 30 -a input_genomes/combined.fa sim/0_3GB/sample_1/reads/anonymous_reads.fq.gz > mapped/aln_s1_5.sam
samtools stats mapped/aln_sX_X.sam | grep ^SN
From the line samtools stats output we found:
Error type | Profile | Sample size (GB) | Mapped Error rate (samtools) |
---|---|---|---|
WGSIM | 3% | 0.2 | 2.86% |
5% | 0.3 | 4.65% | |
ART | mbarc | 0.1 | 3.06% |
0.2 | 3.06% | ||
0.3 | 3.06% | ||
0.5 | 3.05% |
It appears from the WGSIM investigations that we can estimate (if slightly underestimate) the error rate by mapping back the reads to the input genomes.
NOTE The reason we are underestimating using samtools is that the error rate is given based on mapped reads. Thus, there is a proportion of unmapped reads which likely contains errors which are then not included in the error estimate since the read they are sitting on isn't mapped.
We want to quantify the information loss which is kmer specific in the read process. Here we are mainly working with two concepts which are different from mapping.
Kmer counts are absolute in that we do not have fuzzy kmers - either they are identical or they are not. Therefore, the quantification is dependent on the error rate when compared to mapping which can allow say 97% identity with the reference.
Furthermore, a single mutation affects ('destroys') all kmers which with it is included as can be visualized as below:
Thus for a true kmer to survive from genome to count all its bases must be correctly sequenced. Thus:
Given the error rate r then the probability of the event B that_ base b is correctly sequence is given by:
P(B) = 1-r
If we then assume that the probability of correctly sequencing base bi and b(i+1) is dependent events we can apply the Rule of Product P(A∩B)=P(A)×P(B)
Why the probability of a kmer being correctly sequenced can be simplified to:
P_correct = (1-r)^k
Where k is the length of the kmer.
Which entails that for a r=3% and k=21 we expect to get ~52% of the kmers in the reads to be true kmers. Or in other words the raw counts will underestimate the true count by about 1/2 from the error loss alone. (This assumes that the erroneous kmers "disappears"). Thus, we in order to convert counts to pseudo coverage we should employ following correction:
count_corrected = count / (1-r)^k
A kmer is defined as a k-length sequence, and is implemented as the substring of a read starting at position i going to position i+k. And is therefore only defined for i = L-k+1, where L is length of read and k is kmer length.
Thus the last k-1 positions have not a defined kmer as the end of the mer exceeds the read. Illustrated below:
Thus if we use the illustration and say that we only get 1 read for the region. Then:
count = 1 for i < 9
count = 0 for i > 9
where i is the position given by the start of the read.
So for the example above for a read of L=11 and kmer length k=3 we 'loose' 2 kmers or rather the kmers starting at i=10,11 is not defined.
A read which should have given rise to 11*1 counts only gives 9. We loose 2/11=18% of the count information in the read from loss due to edge effects.
This can be generalized to:
Loss_edge = (L - (L-k+1)/L = (k-1)/L
see [notebooks/SimulationSupportForCorrections.ipynb]
Simulate reads with a given error rate and determine the recovery of the correct kmers.
Same base in full length of underlying seqeunce.
Sequencing errors are independent.
Sequencing errors only gives rise to non-counted kmers.
We assume that the read is not at the end
Since the underlying sequence is all "AAA...." and we only have 1 sequence then true coverage is 1 at all positions which equals 150 21-mers of ("AAA.") after controlling for edge effect and taking into account errors.
# Run simulation
seq_true = np.repeat("A", L)
true_kmer = "".join("A" for x in range(k))
mutations = np.zeros(experiments)
count = np.zeros(experiments)
for i in range(experiments):
is_mutated = np.random.randint(1,100, L) <= err_rate
mutations[i] = sum(is_mutated)
seq_obs = seq_true.copy()
seq_obs[is_mutated] = "B"
kmers_obs = kmerize(seq_obs, k)
count[i] = sum(kmers_obs == true_kmer)
A better version of the experiment regarding kmer-correction has been add in the bottom of the notebook and as analysis/06
Here is a preliminary plot - should be rerun with higher number of experiments and more error rates.
Build a small genome "synthesizer" which generates genomes of varying GC content to test the influence of GC content on the error profile.
We found that GC content had no significant impact on error rate.
Conclusions from: #13
We found that correcting the kmer count by correction = (1-err)^k (count / correction) seems to work out fairly well.
We still need to handle the information loss from the end of reads. Ie. the "kmers" from position i=N-k-1:N is lost (N = read lenght). Thus if we have readlength 120 and k = 20 kmers starting at the last 19 positions are lost since we "run out of read".
However, we also observed that for wgsim we observed two "groups" of bgc 1 overcounted 1 undercounted with the overcount being unexpected:
WGSIM 3% error
However for ART-mbarc error profile we did not see the same trend, and in this case mapping performed much better than wgsim:
The better mapping performance is likely due to longer reads.
Tasks: