HenrikSpiegel commented 2 years ago

Conclusions from: #13

We found that correcting the kmer count by correction = (1-err)^k (count / correction) seems to work out fairly well.

We still need to handle the information loss from the end of reads. Ie. the "kmers" from position i=N-k-1:N is lost (N = read lenght). Thus if we have readlength 120 and k = 20 kmers starting at the last 19 positions are lost since we "run out of read".

However, we also observed that for wgsim we observed two "groups" of bgc 1 overcounted 1 undercounted with the overcount being unexpected:
WGSIM 3% error

However for ART-mbarc error profile we did not see the same trend, and in this case mapping performed much better than wgsim:

The better mapping performance is likely due to longer reads.

Tasks:

[ ] Determine and test an read length information loss and test if it works.
- [ ] Figure out how to test this - perhaps we could also just do the statistical argumentation for this?
[ ] Determine the overall error rate for ART-mbarc profile.
- [ ] Finding more information on the profile in general would be really beneficial.
- [ ] Can we estimate the error rate adhoc from mapping quality?
[ ] OPTIONAL: Look into the difference between two "groups" found in WGSIM

HenrikSpiegel commented 2 years ago

From ART documentation: ART article supplementary it was stated that:

the overall error rate of single-end 35bp reads from Illumina Genome Analyzer I (GA-I) is 2.26%

However, no documentation on the mbarc profile or other ART profiles overall error rate has been found so far.

HenrikSpiegel commented 2 years ago

Read-error estimation.

We couldn't use the GSA to get error rate as it is "perfect" maps that exist in those bam files.

However using the following we were able to approximate the error rate for WGSIM:

# 3% error
minimap2 -t 30 -a input_genomes/combined.fa sim/0_2GB/sample_0/reads/anonymous_reads.fq.gz > mapped/aln_s0.sam
minimap2 -t 30 -a input_genomes/combined.fa sim/0_2GB/sample_1/reads/anonymous_reads.fq.gz > mapped/aln_s1.sam

# 5% error
minimap2 -t 30 -a input_genomes/combined.fa sim/0_3GB/sample_0/reads/anonymous_reads.fq.gz > mapped/aln_s0_5.sam
minimap2 -t 30 -a input_genomes/combined.fa sim/0_3GB/sample_1/reads/anonymous_reads.fq.gz > mapped/aln_s1_5.sam 

samtools stats mapped/aln_sX_X.sam | grep ^SN

From the line samtools stats output we found:

Error type	Profile	Sample size (GB)	Mapped Error rate (samtools)
WGSIM	3%	0.2	2.86%
	5%	0.3	4.65%
ART	mbarc	0.1	3.06%
		0.2	3.06%
		0.3	3.06%
		0.5	3.05%

It appears from the WGSIM investigations that we can estimate (if slightly underestimate) the error rate by mapping back the reads to the input genomes.

NOTE The reason we are underestimating using samtools is that the error rate is given based on mapped reads. Thus, there is a proportion of unmapped reads which likely contains errors which are then not included in the error estimate since the read they are sitting on isn't mapped.

HenrikSpiegel commented 2 years ago

Information loss between Reads and Kmer counts

Information loss from errors

We want to quantify the information loss which is kmer specific in the read process. Here we are mainly working with two concepts which are different from mapping.

Kmer counts are absolute in that we do not have fuzzy kmers - either they are identical or they are not. Therefore, the quantification is dependent on the error rate when compared to mapping which can allow say 97% identity with the reference.

Furthermore, a single mutation affects ('destroys') all kmers which with it is included as can be visualized as below:

Thus for a true kmer to survive from genome to count all its bases must be correctly sequenced. Thus:
Given the error rate r then the probability of the event B that_ base b is correctly sequence is given by:

P(B) = 1-r

If we then assume that the probability of correctly sequencing base bi and b(i+1) is dependent events we can apply the Rule of Product P(A∩B)=P(A)×P(B)

Why the probability of a kmer being correctly sequenced can be simplified to:

P_correct = (1-r)^k

Where k is the length of the kmer.

Which entails that for a r=3% and k=21 we expect to get ~52% of the kmers in the reads to be true kmers. Or in other words the raw counts will underestimate the true count by about 1/2 from the error loss alone. (This assumes that the erroneous kmers "disappears"). Thus, we in order to convert counts to pseudo coverage we should employ following correction:

count_corrected = count / (1-r)^k

Information loss from edge effect:

A kmer is defined as a k-length sequence, and is implemented as the substring of a read starting at position i going to position i+k. And is therefore only defined for i = L-k+1, where L is length of read and k is kmer length.

Thus the last k-1 positions have not a defined kmer as the end of the mer exceeds the read. Illustrated below:

Thus if we use the illustration and say that we only get 1 read for the region. Then:

count = 1 for i < 9
count = 0 for i > 9
where i is the position given by the start of the read.

So for the example above for a read of L=11 and kmer length k=3 we 'loose' 2 kmers or rather the kmers starting at i=10,11 is not defined.

A read which should have given rise to 11*1 counts only gives 9. We loose 2/11=18% of the count information in the read from loss due to edge effects.

This can be generalized to:

Loss_edge = (L - (L-k+1)/L = (k-1)/L

HenrikSpiegel commented 2 years ago

Testing the corrections:

see [notebooks/SimulationSupportForCorrections.ipynb]

Simple simulation to support correction formulas

Idea

Simulate reads with a given error rate and determine the recovery of the correct kmers.

Simplifications:

Same base in full length of underlying seqeunce.
Sequencing errors are independent.
Sequencing errors only gives rise to non-counted kmers.
We assume that the read is not at the end

Since the underlying sequence is all "AAA...." and we only have 1 sequence then true coverage is 1 at all positions which equals 150 21-mers of ("AAA.") after controlling for edge effect and taking into account errors.

Main code snippet:

# Run simulation

seq_true = np.repeat("A", L)
true_kmer = "".join("A" for x in range(k))

mutations = np.zeros(experiments)
count = np.zeros(experiments)

for i in range(experiments):
    is_mutated = np.random.randint(1,100, L) <= err_rate
    mutations[i] = sum(is_mutated)
    seq_obs = seq_true.copy()
    seq_obs[is_mutated] = "B"
    kmers_obs = kmerize(seq_obs, k)
    count[i] = sum(kmers_obs == true_kmer)

Results:

newplot (17) newplot (18) newplot (19)

HenrikSpiegel commented 2 years ago

A better version of the experiment regarding kmer-correction has been add in the bottom of the notebook and as analysis/06

Here is a preliminary plot - should be rerun with higher number of experiments and more error rates. newplot (20)

HenrikSpiegel commented 2 years ago

Build a small genome "synthesizer" which generates genomes of varying GC content to test the influence of GC content on the error profile.

HenrikSpiegel commented 2 years ago

17

We found that GC content had no significant impact on error rate.

HenrikSpiegel / Screener

Summary of experiments regarding simulation and kmers -> Understand simulation better. #14

Read-error estimation.

Information loss between Reads and Kmer counts

Information loss from errors

Information loss from edge effect:

Testing the corrections:

Simple simulation to support correction formulas

Idea

Simplifications:

Main code snippet:

Results:

17