ChaissonLab / danbing-tk

Toolkit for VNTR genotyping and repeat-pan genome graph construction
BSD 3-Clause "New" or "Revised" License
21 stars 3 forks source link

Using danbing with 250 bp PE WGS data #13

Closed hchetia closed 3 years ago

hchetia commented 3 years ago

Hi, How can I reliably decide upon -kf for 250 bp paired end reads? The tools help menu suggests 4,1 which are optimized for 150bp paired end reads. Thanks.

joyeuxnoel8 commented 3 years ago

Hi @hchetia ,

The -kf N1 N2 flag invokes a lightweight filter to remove non-VNTR reads if less than N2/N1 kmers are matched. It was intended for speedup by discarding as many non-VNTR reads as possible while keeping all VNTR reads.

In my benchmark on 150 bp PE reads, -kf 4 1 removes about 85\~90% of non-VNTR reads. Theoretically, it keeps all VNTR-reads with up to 3 SNPs and has a 23% chance dropping out VNTR reads with 4 SNPs. Assuming the chance of kmers in non-VNTR reads matching VNTR kmers is uniformly distributed along the sequence, the 85\~90% filtering efficiency corresponds to a random matching rate about 2.6-4.0%.

When extending to 250 PE reads, we might want to scale up the parameters to something like -kf 6 1 -cth 75 to achieve similar filtering efficiency under the same random matching rate. But we also need to consider other parameters that impose more constraints. The threading step in danbing only allows for up to two SNPs or short indels (each <3bp) at this moment and might not work well when too many kmers are unaligned. The threading parameter -gc N3 works better when (read_len - k + 1 - N3) // k <= 2, meaning at most 4 SNPs/indels with 2 edited/aligned and 2 unaligned. This means -gc should be at least 168 for read_len:250, k:21.

Hope this helps, -Tony

hchetia commented 3 years ago

Thanks for the explanation @joyeuxnoel8