UPHL-BioNGS / Cecret

Reference-based consensus creation
MIT License
49 stars 26 forks source link

bedtools multicov overlap threshold is too low #30

Closed jgarbe256 closed 3 years ago

jgarbe256 commented 3 years ago

Cecret runs bedtools to calculate the read depth of each amplicon using this command: bedtools multicov -bams !{bam} -bed amplicon.bed This counts the number of alignments in the bam file that overlap each amplicon region defined in the bed file. Bedtools' default definition of "overlap" is 1bp, so when bedtools counts up the number of alignments overlapping amplicon 5, it reports the number of alignments for amplicons 4, 5, and 6, since amplicons 4 and 6 overlap 5. If you require an overlap of 50% (-f .5), then the overlapping alignments from amplicons 4 and 6 won't be included in the count. This will bring the number of "failed amplicons" reported by bedtools more in line with the number reported by samtools.

Here's an example of bedtools multicov with the default 1bp overlap: MN908947.3 30 410 nCoV-2019_1 7450 MN908947.3 320 726 nCoV-2019_2 13592 MN908947.3 642 1028 nCoV-2019_3 13859 MN908947.3 943 1337 nCoV-2019_4 10224 MN908947.3 1242 1651 nCoV-2019_5 7396 MN908947.3 1573 1964 nCoV-2019_6 8599 MN908947.3 1875 2269 nCoV-2019_7 6648 MN908947.3 2181 2592 nCoV-2019_8 4817 MN908947.3 2505 2904 nCoV-2019_9 6788 MN908947.3 2826 3210 nCoV-2019_10 10485

And here's the same thing but requiring an overlap of 50% (-f .5) MN908947.3 30 410 nCoV-2019_1 383 MN908947.3 320 726 nCoV-2019_2 6930 MN908947.3 642 1028 nCoV-2019_3 5911 MN908947.3 943 1337 nCoV-2019_4 906 MN908947.3 1242 1651 nCoV-2019_5 3347 MN908947.3 1573 1964 nCoV-2019_6 3047 MN908947.3 1875 2269 nCoV-2019_7 2151 MN908947.3 2181 2592 nCoV-2019_8 1358 MN908947.3 2505 2904 nCoV-2019_9 1283 MN908947.3 2826 3210 nCoV-2019_10 4117

When an overlap of 50% is required bedtools correctly reports the 383 alignments for amplicon 1 and 6,930 for amplicon 2. Running with the default 1bp overlap the number of amplicon 1 alignments is reported as 7,450, which is (approximately) the number of alignments to amplicons 1 and 2 (383 + 6,930 = 7313).

erinyoung commented 3 years ago

This is a great point!

Originally I just grabbed some quick and simple methods for detecting amplicon failure, and I think your suggestion will be very helpful.

erinyoung commented 3 years ago

Thank you for your suggestion, but I ended up not using -f 0.5. The default value is -f 0.1.

This process is only intended to be used as a rough estimate of coverage because I don't put any effort in ensuring that the start and stop of the reads are within the amplicon.

In the next iteration of Cecret, I'm having the end user supply the amplicon file and I've added a free form params.bedtools_multicov_options parameter that can be edited in a config file or on the command line (Simply use --bedtools_multicov_options '-f 0.5')

Here is what I noticed. Using the bedfile of the amplicons created from the artic V3 primers, this is the depth observed for the first ten amplicons with using strictly default values:

$ bedtools multicov -bams test.primertrim.sorted.bam -bed Cecret/configs/nCoV-2019.insert.bed | head
MN908947.3  54  385 1   1   +   8490
MN908947.3  342 704 2   2   +   13818
MN908947.3  664 1004    3   1   +   13726
MN908947.3  965 1312    4   2   +   11357
MN908947.3  1264    1623    5   1   +   8070
MN908947.3  1595    1942    6   2   +   12467
MN908947.3  1897    2242    7   1   +   12620
MN908947.3  2205    2568    8   2   +   6930
MN908947.3  2529    2880    9   1   +   4234
MN908947.3  2850    3183    10  2   +   5437

This may include unintended reads, which I believe is your concern. I then added the iteration of -f 0.1 to see how it changes the numbers, and I observed a filtering of reads that didn't really cover the amplicon region.

$ bedtools multicov -bams test.primertrim.sorted.bam -bed Cecret/configs/nCoV-2019.insert.bed -f 0.1 | head
MN908947.3  54  385 1   1   +   7780
MN908947.3  342 704 2   2   +   12829
MN908947.3  664 1004    3   1   +   12485
MN908947.3  965 1312    4   2   +   10749
MN908947.3  1264    1623    5   1   +   4129
MN908947.3  1595    1942    6   2   +   11363
MN908947.3  1897    2242    7   1   +   11679
MN908947.3  2205    2568    8   2   +   6097
MN908947.3  2529    2880    9   1   +   2751
MN908947.3  2850    3183    10  2   +   2973

As I increased this value to -f 0.4, I noticed an excess of reads that were getting filtered out.

$ bedtools multicov -bams test.primertrim.sorted.bam -bed Cecret/configs/nCoV-2019.insert.bed -f 0.4 | head
MN908947.3  54  385 1   1   +   1472
MN908947.3  342 704 2   2   +   1909
MN908947.3  664 1004    3   1   +   1592
MN908947.3  965 1312    4   2   +   1972
MN908947.3  1264    1623    5   1   +   345
MN908947.3  1595    1942    6   2   +   1931
MN908947.3  1897    2242    7   1   +   1896
MN908947.3  2205    2568    8   2   +   637
MN908947.3  2529    2880    9   1   +   432
MN908947.3  2850    3183    10  2   +   663

And when I reached -f 0.5, no read covered an amplicon enough to consider it covered. It is behaving like this because -f is the percentage of the region in the bedfile covered by the read. With each amplicon being ~300+ bp, there was no way for my ~150 bp MiSeq reads to cover 50% of the amplicon, espcially when the primers are trimmed off. You must be using a different kit.

$ bedtools multicov -bams test.primertrim.sorted.bam -bed  Cecret/configs/nCoV-2019.insert.bed -f 0.5 | head
MN908947.3  54  385 1   1   +   0
MN908947.3  342 704 2   2   +   0
MN908947.3  664 1004    3   1   +   0
MN908947.3  965 1312    4   2   +   0
MN908947.3  1264    1623    5   1   +   0
MN908947.3  1595    1942    6   2   +   0
MN908947.3  1897    2242    7   1   +   0
MN908947.3  2205    2568    8   2   +   0
MN908947.3  2529    2880    9   1   +   0
MN908947.3  2850    3183    10  2   +   0
jgarbe256 commented 3 years ago

Yes, I've been working with 250 or 300bp reads, I see why -f 0.5 wouldn't be appropriate for 150bp reads. I think 0.1 might still be a bit too lenient for 150bp reads. The artic amplicons overlap by 30-60bp, even after primer trimming, so most reads will meet a 10% overlap threshold for the neighboring amplicon region. A 150bp miseq trimmed down to 100bp will have a 25% overlap with a 400bp amplicon region, so it seems like a threshold between 0.1 and 0.25 would be ideal (0.15 or 0.2). Adding functionality to allow the user to override the default setting certainly addresses the issue though.

erinyoung commented 3 years ago

Here at UPHL, we also do covidseq runs that have reads that are ~50 bp after being trimmed. Due to your reasoning, the default of -f 0.1 is likely good enough.

There are many kits that are used to create amplicon based Illumina reads. Plus, there are also amplicon schemes, even for SARS-CoV-2, that use much larger amplicon sizes. I don't think it's possible to create a default that is 100% performative for every situation without putting a lot of effort into this. I could 1) look at the amplicon sizes, 2) look at the read size after primer trimming and go from there, but if I'm going to put that much effort into it, I should probably ensure that the start and end of each read are within their amplicon as well (which isn't completely unreasonable, because artic already has something like this in their pipeline). Then I should package that up into a little tool that would work independently of this workflow, write a paper, and get it out to the community. Unfortunately, I don't have time to put that into practice, and the rough estimate given by bedtools multicov is good enough for my purposes.

One of the great things about github is that other people can contribute to this, and I am happy to review efforts into this area.

You're more than welcome to change the -f value for your own purposes. You can do this in a config file by adjusting the params.bedtools_multicov_options = '-f 0.5' or you can change this on the command line like

nextflow run Cecret.nf -c configs/singularity.config --bedtools_multicov_options '-f 0.5'

Or you can just set params.bedtools_multicov = false and just look at the results from samtools ampliconstats. Although, this comes with the caveat that this has its own issues since ivar trim messes with the cigar string, so some reads count as negative coverage. This is briefly mentioned in https://github.com/UPHL-BioNGS/Cecret/issues/27