fulcrumgenomics / fgbio

Tools for working with genomic and high throughput sequencing data.
http://fulcrumgenomics.github.io/fgbio/
MIT License
309 stars 67 forks source link

GroupReadsByUmi filtered reads number does not cover all filtered reads? #972

Open milnikol opened 6 months ago

milnikol commented 6 months ago

Hi!

I have a question about filtered reads within GroupReadsByUmi.

My input bam file has 9 945 567 reads. When I ran the GroupReadsByUmi, 9 941 432 were accepted :

[2024/03/08 10:09:49 | GroupReadsByUmi | Info] Accepted 9,657,892 reads for grouping.
[2024/03/08 10:09:49 | GroupReadsByUmi | Info] Filtered out 84,016 reads due to mapping issues.
[2024/03/08 10:09:49 | GroupReadsByUmi | Info] Filtered out 199,524 reads that contained one or more Ns in their UMIs.

When I add number of accepted reads for grouping (9 657 892) with number of reads filtered due to mapping issues (84 016) and number of reads that contained Ns in UMIs (199 524), I obtain number 9941432.

This means that 4135 reads (9945567 - 9941432) were also filtered? What are the other reasons for filtering reads?

Thank you very much!

nh13 commented 6 months ago

Do you have any secondary or supplementary alignments? Those filtered out, along with a list of others. See the list "During grouping, reads and templates are filtered out as follows" here: https://fulcrumgenomics.github.io/fgbio/tools/latest/GroupReadsByUmi.html