Daniel-Liu-c0deb0t / UMICollapse

Accelerating the deduplication and collapsing process for reads with Unique Molecular Identifiers (UMI). Heavily optimized for scalability and orders of magnitude faster than a previous tool.
MIT License
62 stars 8 forks source link

UMIcollapse check issue #27

Closed MardahlM closed 5 months ago

MardahlM commented 5 months ago

Dear Daniel Liu.

I am revisiting my code for miRNA analysis prior to publication.

I am double checking that the deduplication occurs as it should.

I grep'ed two UMI tags and wrote out the lines for each tag prior to UMICollapse and after.

When I grep the tag ACATGCCCGGGA, I have three PCR triplicates correctly collapsed to one each.
The low quality 'Svu-Novel-14-Bantam-04-pre' seems to be discarded properly.

(hidden) ➜ 0L1 samtools view 0L1-aligned-sorted-UMIcollapsed.bam | grep 'ACATGCCCGGGA'

A00126:243:HLFL7DSX3:3:1106:9625:22326_ACATGCCCGGGA 0 Svu-Mir-279-o42-pre 7 255 22M 0 0 CTGGGTACGCATCTTAGTCGCG FFFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:2455:24777:18411_ACATGCCCGGGA 0 Svu-Novel-6-Bantam-06-pre 7 255 22M 0 0 TGAGATCATCACCATAAGCACA FFFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:1526:7500:13166_ACATGCCCGGGA 0 Svu-Novel-15-Bantam-05-pre 7 255 21M * 0 0 AGAGATCATGGATATGACCAT FFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:21 NM:i:0 (hidden) ➜ 0L1 samtools view 0L1-aligned-sorted.bam | grep 'ACATGCCCGGGA'

A00126:243:HLFL7DSX3:3:1106:9778:21934_ACATGCCCGGGA 0 Svu-Mir-279-o42-pre 7 255 22M 0 0 CTGGGTACGCATCTTAGTCGCG FFFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:1106:9769:21950_ACATGCCCGGGA 0 Svu-Mir-279-o42-pre 7 255 22M 0 0 CTGGGTACGCATCTTAGTCGCG FFFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:1106:9625:22326_ACATGCCCGGGA 0 Svu-Mir-279-o42-pre 7 255 22M 0 0 CTGGGTACGCATCTTAGTCGCG FFFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:1455:25418:3803_ACATGCCCGGGA 0 Svu-Novel-6-Bantam-06-pre 7 255 22M 0 0 TGAGATCATCACCATAAGCACA FFFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:1468:22923:20682_ACATGCCCGGGA 0 Svu-Novel-6-Bantam-06-pre 7 255 22M 0 0 TGAGATCATCACCATAAGCACA FFFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:2455:24777:18411_ACATGCCCGGGA 0 Svu-Novel-6-Bantam-06-pre 7 255 22M 0 0 TGAGATCATCACCATAAGCACA FFFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:1377:21712:2206_ACATGCCCGGGA 0 Svu-Novel-14-Bantam-04-pre 7 255 21M 0 0 TGGGATCATGGATATGACCAT ,F,F,FFFFFF,FFF:FF:F, XA:i:1 MD:Z:2A18 NM:i:1 A00126:243:HLFL7DSX3:3:1511:11153:10661_ACATGCCCGGGA 0 Svu-Novel-15-Bantam-05-pre 7 255 21M 0 0 AGAGATCATGGATATGACCAT FFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:21 NM:i:0 A00126:243:HLFL7DSX3:3:1526:7988:12352_ACATGCCCGGGA 0 Svu-Novel-15-Bantam-05-pre 7 255 21M 0 0 AGAGATCATGGATATGACCAT FFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:21 NM:i:0 A00126:243:HLFL7DSX3:3:1526:7500:13166_ACATGCCCGGGA 0 Svu-Novel-15-Bantam-05-pre 7 255 21M 0 0 AGAGATCATGGATATGACCAT FFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:21 NM:i:0

When I grep another tag CACCGTGGAACC I see the deduplication by UMICollapse as being excessive. I get two resulting reads, but I think there should be more.

(hidden) ➜ 0L1 samtools view 0L1-aligned-sorted-UMIcollapsed.bam | grep 'CACCGTGGAACC'

A00126:243:HLFL7DSX3:3:1153:19723:34710_CACCGTGGAACC 0 Svu-Mir-279-o42-pre 7 255 22M 0 0 CTGGGTACGCATCTTAGTCGCG FFFFFFFFFFFFF:FFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:2122:2853:8093_CACCGTGGAACC 0 Svu-Mir-87-o1-pre 41 255 23M 0 0 TTGAGCAACGAATATGTGCGGGA FFFFFFFFFFFFFFFFFFFFFF: XA:i:0 MD:Z:23 NM:i:0

(hidden) ➜ 0L1 samtools view 0L1-aligned-sorted.bam | grep 'CACCGTGGAACC'

A00126:243:HLFL7DSX3:3:1153:19723:34710_CACCGTGGAACC 0 Svu-Mir-279-o42-pre 7 255 22M 0 0 CTGGGTACGCATCTTAGTCGCG FFFFFFFFFFFFF:FFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:1557:32380:8531_CACCGTGGAACC 0 Svu-Bantam-o1-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT ,FFFFFFFFFFFFF:F:FFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:2556:31259:35869_CACCGTGGAACC 0 Svu-Bantam-o1-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT FFFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:1557:32380:8531_CACCGTGGAACC 0 Svu-Bantam-o2-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT ,FFFFFFFFFFFFF:F:FFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:2556:31259:35869_CACCGTGGAACC 0 Svu-Bantam-o2-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT FFFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:1462:16866:17331_CACCGTGGAACC 0 Svu-Mir-87-o1-pre 41 255 23M 0 0 TTGAGCAACGAATATGTGCGGGA FFFFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:23 NM:i:0 A00126:243:HLFL7DSX3:3:2122:2853:8093_CACCGTGGAACC 0 Svu-Mir-87-o1-pre 41 255 23M 0 0 TTGAGCAACGAATATGTGCGGGA FFFFFFFFFFFFFFFFFFFFFF: XA:i:0 MD:Z:23 NM:i:0 A00126:243:HLFL7DSX3:3:1107:23430:2895_CACCGTGGAACC 0 Svu-Novel-14-Bantam-04-pre 7 255 21M 0 0 TGAGATCATGGATATGACCAT FFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:21 NM:i:0 A00126:243:HLFL7DSX3:3:2226:25852:27132_CACCGTGGAACC 0 Svu-Novel-14-Bantam-04-pre 7 255 21M * 0 0 TGAGATCATGGATATGACCAT FFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:21 NM:i:0

Can you help me explain why these are not deduplicated properly?

Thank you in advance.

Cheers, Maibritt

MardahlM commented 5 months ago

When I subset on the last UMI tag given above UMIcollapse performs correctly:

Arguments [bam, -i, ~/0L1-subset_CACCGTGGAACC.bam, -o, ~/0L1-aligned-sorted-UMIcollapsed.bam] Done reading input file into memory! Number of input reads 9 Number of removed unmapped reads 0 Number of unremoved reads 9 Number of unique alignment positions 5 Average number of UMIs per alignment position 1.0 Max number of UMIs over all alignment positions 1 Number of reads after deduplicating 5 UMI collapsing finished in 0.087 seconds!

BEFORE UMIcollapse (hidden) ➜ 0L1 samtools view 0L1-subset_CACCGTGGAACC.bam A00126:243:HLFL7DSX3:3:1153:19723:34710_CACCGTGGAACC 0 Svu-Mir-279-o42-pre 7 255 22M 0 0 CTGGGTACGCATCTTAGTCGCG FFFFFFFFFFFFF:FFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:1557:32380:8531_CACCGTGGAACC 0 Svu-Bantam-o1-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT ,FFFFFFFFFFFFF:F:FFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:2556:31259:35869_CACCGTGGAACC 0 Svu-Bantam-o1-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT FFFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:1557:32380:8531_CACCGTGGAACC 0 Svu-Bantam-o2-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT ,FFFFFFFFFFFFF:F:FFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:2556:31259:35869_CACCGTGGAACC 0 Svu-Bantam-o2-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT FFFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:1462:16866:17331_CACCGTGGAACC 0 Svu-Mir-87-o1-pre 41 255 23M 0 0 TTGAGCAACGAATATGTGCGGGA FFFFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:23 NM:i:0 A00126:243:HLFL7DSX3:3:2122:2853:8093_CACCGTGGAACC 0 Svu-Mir-87-o1-pre 41 255 23M 0 0 TTGAGCAACGAATATGTGCGGGA FFFFFFFFFFFFFFFFFFFFFF: XA:i:0 MD:Z:23 NM:i:0 A00126:243:HLFL7DSX3:3:1107:23430:2895_CACCGTGGAACC 0 Svu-Novel-14-Bantam-04-pre 7 255 21M 0 0 TGAGATCATGGATATGACCAT FFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:21 NM:i:0 A00126:243:HLFL7DSX3:3:2226:25852:27132_CACCGTGGAACC 0 Svu-Novel-14-Bantam-04-pre 7 255 21M * 0 0 TGAGATCATGGATATGACCAT FFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:21 NM:i:0

AFTER UMIcollapse (hidden) ➜ 0L1 samtools view 0L1-aligned-sorted-UMIcollapsed.bam A00126:243:HLFL7DSX3:3:1153:19723:34710_CACCGTGGAACC 0 Svu-Mir-279-o42-pre 7 255 22M 0 0 CTGGGTACGCATCTTAGTCGCG FFFFFFFFFFFFF:FFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:2556:31259:35869_CACCGTGGAACC 0 Svu-Bantam-o1-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT FFFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:2556:31259:35869_CACCGTGGAACC 0 Svu-Bantam-o2-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT FFFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:2122:2853:8093_CACCGTGGAACC 0 Svu-Mir-87-o1-pre 41 255 23M 0 0 TTGAGCAACGAATATGTGCGGGA FFFFFFFFFFFFFFFFFFFFFF: XA:i:0 MD:Z:23 NM:i:0 A00126:243:HLFL7DSX3:3:2226:25852:27132_CACCGTGGAACC 0 Svu-Novel-14-Bantam-04-pre 7 255 21M * 0 0 TGAGATCATGGATATGACCAT FFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:21 NM:i:0

Any idea what is going on? or how I could improve?

Cheers, Maibritt

Daniel-Liu-c0deb0t commented 5 months ago

Thanks for reporting this!

Assuming there's no issue in UMICollapse, the only explanation would be that you have some other UMI that is one substitution away from CACCGTGGAACC for your reads aligned to Svu-Bantam-o1-pre, Svu-Bantam-o2-pre, etc. That UMI would also have to appear at a higher frequency than CACCGTGGAACC, causing CACCGTGGAACC to be clustered together with those UMIs, thus causing reads with CACCGTGGAACC to be discarded.

To verify if this is the case, can you try examining the data to see if there are UMIs one substitution away from CACCGTGGAACC? If that is not easy to achieve, perhaps try running UMICollapse again with -k 0 to disallow any substitutions and see if you get the expected result.

MardahlM commented 5 months ago

Dear Daniel,

Thank you for your swift reply. Always nice to get a quick resolution - however, alas I am back with more questions...

I continued and tested -k0, but this didn't seem to help the issue. I made a test sample with 9 entries duplicated these (then 18 entries) and then changed the last nucleotide in every copied entry before testing again. From this I get 8 final entries, which I guess is ok, but I am not fully aware of why and especially why -k0 doesn't work. If I write -k 0 at the very end of my command it doesn't seem to work at all.

Here is my command with -k0 :

Arguments [bam, -i, /Volumes/drylab/Data-Set-Go/Svulgaris2/BAM_UMI_QUANTIFY_ANALYSIS_svulgaris/results_samtools/0L1/0L1-subset_CACCGTGGAACC_sorted.bam, -o, /Volumes/drylab/Data-Set-Go/Svulgaris2/BAM_UMI_QUANTIFY_ANALYSIS_svulgaris/results_samtools/0L1/0L1-aligned-sorted-UMIcollapsed_test.bam, -k0] Done reading input file into memory! Number of input reads 18 Number of removed unmapped reads 0 Number of unremoved reads 18 Number of unique alignment positions 5 Average number of UMIs per alignment position 2.2 Max number of UMIs over all alignment positions 3 Number of reads after deduplicating 8 UMI collapsing finished in 0.085 seconds!

These are my before and after files:

(hidden) ➜ 0L1 samtools view 0L1-subset_CACCGTGGAACC_sorted.bam A00126:243:HLFL7DSX3:3:1153:19723:34710_CACCGTGGAACC 0 Svu-Mir-279-o42-pre 7 255 22M 0 0 CTGGGTACGCATCTTAGTCGCG FFFFFFFFFFFFF:FFFFFFFFXA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:1153:19723:34710_CACCGTGGAACT 0 Svu-Mir-279-o42-pre 7 255 22M 0 0 CTGGGTACGCATCTTAGTCGCG FFFFFFFFFFFFF:FFFFFFFFXA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:1557:32380:8531_CACCGTGGAACC 0 Svu-Bantam-o1-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT ,FFFFFFFFFFFFF:F:FFFFFXA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:2556:31259:35869_CACCGTGGAACC 0 Svu-Bantam-o1-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT FFFFFFFFFFFFFFFFFFFFFFXA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:1557:32380:8531_CACCGTGGAACT 0 Svu-Bantam-o1-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT ,FFFFFFFFFFFFF:F:FFFFFXA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:2556:31259:35869_CACCGTGGAACT 0 Svu-Bantam-o1-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT FFFFFFFFFFFFFFFFFFFFFFXA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:1557:32380:8531_CACCGTGGAACC 0 Svu-Bantam-o2-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT ,FFFFFFFFFFFFF:F:FFFFFXA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:2556:31259:35869_CACCGTGGAACC 0 Svu-Bantam-o2-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT FFFFFFFFFFFFFFFFFFFFFFXA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:1557:32380:8531_CACCGTGGAACCT 0 Svu-Bantam-o2-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT ,FFFFFFFFFFFFF:F:FFFFFXA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:2556:31259:35869_CACCGTGGAACT 0 Svu-Bantam-o2-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT FFFFFFFFFFFFFFFFFFFFFFXA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:1462:16866:17331_CACCGTGGAACC 0 Svu-Mir-87-o1-pre 41 255 23M 0 0 TTGAGCAACGAATATGTGCGGGA FFFFFFFFFFFFFFFFFFFFFFXA:i:0 MD:Z:23 NM:i:0 A00126:243:HLFL7DSX3:3:2122:2853:8093_CACCGTGGAACC 0 Svu-Mir-87-o1-pre 41 255 23M 0 0 TTGAGCAACGAATATGTGCGGGA FFFFFFFFFFFFFFFFFFFFFFXA:i:0 MD:Z:23 NM:i:0 A00126:243:HLFL7DSX3:3:1462:16866:17331_CACCGTGGAACT 0 Svu-Mir-87-o1-pre 41 255 23M 0 0 TTGAGCAACGAATATGTGCGGGA FFFFFFFFFFFFFFFFFFFFFFXA:i:0 MD:Z:23 NM:i:0 A00126:243:HLFL7DSX3:3:2122:2853:8093_CACCGTGGAACT 0 Svu-Mir-87-o1-pre 41 255 23M 0 0 TTGAGCAACGAATATGTGCGGGA FFFFFFFFFFFFFFFFFFFFFFXA:i:0 MD:Z:23 NM:i:0 A00126:243:HLFL7DSX3:3:1107:23430:2895_CACCGTGGAACC 0 Svu-Novel-14-Bantam-04-pre 7 255 21M 0 0 TGAGATCATGGATATGACCAT FFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:21 NM:i:0 A00126:243:HLFL7DSX3:3:2226:25852:27132_CACCGTGGAACC 0 Svu-Novel-14-Bantam-04-pre 7 255 21M 0 0 TGAGATCATGGATATGACCAT FFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:21 NM:i:0 A00126:243:HLFL7DSX3:3:1107:23430:2895_CACCGTGGAACT 0 Svu-Novel-14-Bantam-04-pre 7 255 21M 0 0 TGAGATCATGGATATGACCAT FFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:21 NM:i:0 A00126:243:HLFL7DSX3:3:2226:25852:27132_CACCGTGGAACT 0 Svu-Novel-14-Bantam-04-pre 7 255 21M 0 0 TGAGATCATGGATATGACCAT FFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:21 NM:i:0 (hidden) ➜ 0L1 samtools view 0L1-aligned-sorted-UMIcollapsed_test.bam A00126:243:HLFL7DSX3:3:1153:19723:34710_CACCGTGGAACT 0 Svu-Mir-279-o42-pre 7 255 22M 0 0 CTGGGTACGCATCTTAGTCGCG FFFFFFFFFFFFF:FFFFFFFFXA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:2556:31259:35869_CACCGTGGAACC 0 Svu-Bantam-o1-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT FFFFFFFFFFFFFFFFFFFFFFXA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:2556:31259:35869_CACCGTGGAACT 0 Svu-Bantam-o1-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT FFFFFFFFFFFFFFFFFFFFFFXA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:2556:31259:35869_CACCGTGGAACC 0 Svu-Bantam-o2-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT FFFFFFFFFFFFFFFFFFFFFFXA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:2122:2853:8093_CACCGTGGAACC 0 Svu-Mir-87-o1-pre 41 255 23M 0 0 TTGAGCAACGAATATGTGCGGGA FFFFFFFFFFFFFFFFFFFFFFXA:i:0 MD:Z:23 NM:i:0 A00126:243:HLFL7DSX3:3:2122:2853:8093_CACCGTGGAACT 0 Svu-Mir-87-o1-pre 41 255 23M 0 0 TTGAGCAACGAATATGTGCGGGA FFFFFFFFFFFFFFFFFFFFFFXA:i:0 MD:Z:23 NM:i:0 A00126:243:HLFL7DSX3:3:2226:25852:27132_CACCGTGGAACC 0 Svu-Novel-14-Bantam-04-pre 7 255 21M 0 0 TGAGATCATGGATATGACCAT FFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:21 NM:i:0 A00126:243:HLFL7DSX3:3:2226:25852:27132_CACCGTGGAACT 0 Svu-Novel-14-Bantam-04-pre 7 255 21M 0 0 TGAGATCATGGATATGACCAT FFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:21 NM:i:0

Can you tell me whether the -k is wrongly put in the command?

If I put in a space between -k and 0 I get no deduplication and this is written to the terminal:

Arguments [bam, -i, /Volumes/drylab/Data-Set-Go/Svulgaris2/BAM_UMI_QUANTIFY_ANALYSIS_svulgaris/results_samtools/0L1/0L1-subset_CACCGTGGAACC_sorted.bam, -o, /Volumes/drylab/Data-Set-Go/Svulgaris2/BAM_UMI_QUANTIFY_ANALYSIS_svulgaris/results_samtools/0L1/0L1-aligned-sorted-UMIcollapsed_test.bam, -k, 0] Done reading input file into memory! Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1 at umicollapse.data.NgramBKTree$Node.initNode(NgramBKTree.java:137) at umicollapse.data.NgramBKTree.insertBKTree(NgramBKTree.java:103) at umicollapse.data.NgramBKTree.insert(NgramBKTree.java:58) at umicollapse.data.NgramBKTree.init(NgramBKTree.java:28) at umicollapse.algo.Directional.apply(Directional.java:32) at umicollapse.main.DeduplicateSAM.lambda$deduplicateAndMerge$1(DeduplicateSAM.java:160) at java.base/java.util.HashMap$EntrySpliterator.forEachRemaining(HashMap.java:1850) at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:762) at umicollapse.main.DeduplicateSAM.deduplicateAndMerge(DeduplicateSAM.java:147) at umicollapse.main.Main.main(Main.java:212) [main_samview] fail to read the header from "/Volumes/drylab/Data-Set-Go/Svulgaris2/BAM_UMI_QUANTIFY_ANALYSIS_svulgaris/results_samtools/0L1/0L1-aligned-sorted-UMIcollapsed_test.bam".

I cross my fingers you can tell me what is wrong

Cheers, Maibritt

Daniel-Liu-c0deb0t commented 5 months ago

Thanks for testing it out! It seems like this is a problem with UMICollapse, as -k 0 should work (I recall it working in the past). I'll debug this on my end and get back to you.

Daniel-Liu-c0deb0t commented 5 months ago

I can reproduce your issues successfully. Will debug now.

Daniel-Liu-c0deb0t commented 5 months ago

Ok so the issue is that one of your UMIs is length 13 while the others are length 12. I've updated UMICollapse to be more reasonable when dealing with UMIs of different lengths: the first alignment in the file dictates the UMI length, and any UMIs longer are truncated to be the same length.

I have confirmed that running on your manually duplicated alignments with -k 0 works correctly now. -k 1 (the default) also works correctly. Please give the updated UMICollapse a try on your data!

Going back to the original question, you have to pick whether to use -k 0 (more stringent and smaller clusters, will only cluster UMIs by identity) or -k 1 (larger clusters, allow 1 substitution) based on what's best for your data.

MardahlM commented 5 months ago

OK, my mistake that one UMI was one nt longer - we only ordered 12nt UMIs - could there be a non-negligible error margin on that? Anyway, I just made an exact copy and manually edited the UMI. The uneven UMI lengths was not my intention - and exactly why I don't rely on humans to sort big data. For my human eyes I just redid the copying to make sure there is a duplicate of each sam entry with exactly one nt substitution at the end of the UMI. I just want to make sure I know how UMIcollapse works. I am looking for miRNA copies and therefore I probably need to use -k 0, however, there is very small likelihood that there is a base miscalling in a umi. When I re-compile your latest UMIcollapes -k 0 and -k 1 works. However, I am not fully satisfied with my understanding of UMIcollapse. I am hoping you can help me one more time.

Here is the test data without the longer UMI: (hidden) ➜ 0L1 samtools view 0L1-subset_CACCGTGGAACC.bam A00126:243:HLFL7DSX3:3:1153:19723:34710_CACCGTGGAACC 0 Svu-Mir-279-o42-pre 7 255 22M 0 0 CTGGGTACGCATCTTAGTCGCG FFFFFFFFFFFFF:FFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:1557:32380:8531_CACCGTGGAACC 0 Svu-Bantam-o1-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT ,FFFFFFFFFFFFF:F:FFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:2556:31259:35869_CACCGTGGAACC 0 Svu-Bantam-o1-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT FFFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:1557:32380:8531_CACCGTGGAACC 0 Svu-Bantam-o2-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT ,FFFFFFFFFFFFF:F:FFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:2556:31259:35869_CACCGTGGAACC 0 Svu-Bantam-o2-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT FFFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:1462:16866:17331_CACCGTGGAACC 0 Svu-Mir-87-o1-pre 41 255 23M 0 0 TTGAGCAACGAATATGTGCGGGA FFFFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:23 NM:i:0 A00126:243:HLFL7DSX3:3:2122:2853:8093_CACCGTGGAACC 0 Svu-Mir-87-o1-pre 41 255 23M 0 0 TTGAGCAACGAATATGTGCGGGA FFFFFFFFFFFFFFFFFFFFFF: XA:i:0 MD:Z:23 NM:i:0 A00126:243:HLFL7DSX3:3:1107:23430:2895_CACCGTGGAACC 0 Svu-Novel-14-Bantam-04-pre 7 255 21M 0 0 TGAGATCATGGATATGACCAT FFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:21 NM:i:0 A00126:243:HLFL7DSX3:3:2226:25852:27132_CACCGTGGAACC 0 Svu-Novel-14-Bantam-04-pre 7 255 21M 0 0 TGAGATCATGGATATGACCAT FFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:21 NM:i:0 A00126:243:HLFL7DSX3:3:1153:19723:34710_CACCGTGGAACT 0 Svu-Mir-279-o42-pre 7 255 22M 0 0 CTGGGTACGCATCTTAGTCGCG FFFFFFFFFFFFF:FFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:1557:32380:8531_CACCGTGGAACT 0 Svu-Bantam-o1-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT ,FFFFFFFFFFFFF:F:FFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:2556:31259:35869_CACCGTGGAACT 0 Svu-Bantam-o1-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT FFFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:1557:32380:8531_CACCGTGGAACT 0 Svu-Bantam-o2-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT ,FFFFFFFFFFFFF:F:FFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:2556:31259:35869_CACCGTGGAACT 0 Svu-Bantam-o2-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT FFFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:1462:16866:17331_CACCGTGGAACT 0 Svu-Mir-87-o1-pre 41 255 23M 0 0 TTGAGCAACGAATATGTGCGGGA FFFFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:23 NM:i:0 A00126:243:HLFL7DSX3:3:2122:2853:8093_CACCGTGGAACT 0 Svu-Mir-87-o1-pre 41 255 23M 0 0 TTGAGCAACGAATATGTGCGGGA FFFFFFFFFFFFFFFFFFFFFF: XA:i:0 MD:Z:23 NM:i:0 A00126:243:HLFL7DSX3:3:1107:23430:2895_CACCGTGGAACT 0 Svu-Novel-14-Bantam-04-pre 7 255 21M 0 0 TGAGATCATGGATATGACCAT FFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:21 NM:i:0 A00126:243:HLFL7DSX3:3:2226:25852:27132_CACCGTGGAACT 0 Svu-Novel-14-Bantam-04-pre 7 255 21M 0 0 TGAGATCATGGATATGACCAT FFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:21 NM:i:0

Here is the -k 0 data

Sample 0L1: Samtools Taking mapped reads into a bam file Sample 0L1: Samtools Sorting the bam file in order of genomic coordinates – left to right Sample 0L1: Samtools Indexing the bam file for quick access downstream Generate idxstats Sample 0L1: UMIcollapse on bam files Arguments [bam, -i, /Volumes/drylab/Data-Set-Go/hidden/results_samtools/0L1/0L1-subset_CACCGTGGAACC_sorted.bam, -o, /Volumes/drylab/Data-Set-Go/Svulgaris2/hidden/results_samtools/0L1/0L1-aligned-sorted-UMIcollapsed_test.bam, -k, 0] Done reading input file into memory! Number of input reads 18 Number of removed unmapped reads 0 Number of unremoved reads 18 Number of unique alignment positions 5 Average number of UMIs per alignment position 2.0 Max number of UMIs over all alignment positions 2 Number of reads after deduplicating 10 UMI collapsing finished in 0.14 seconds! A00126:243:HLFL7DSX3:3:1153:19723:34710_CACCGTGGAACC 0 Svu-Mir-279-o42-pre 7 255 22M 0 0 CTGGGTACGCATCTTAGTCGCG FFFFFFFFFFFFF:FFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:1153:19723:34710_CACCGTGGAACT 0 Svu-Mir-279-o42-pre 7 255 22M 0 0 CTGGGTACGCATCTTAGTCGCG FFFFFFFFFFFFF:FFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:2556:31259:35869_CACCGTGGAACC 0 Svu-Bantam-o1-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT FFFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:2556:31259:35869_CACCGTGGAACT 0 Svu-Bantam-o1-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT FFFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:2556:31259:35869_CACCGTGGAACC 0 Svu-Bantam-o2-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT FFFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:2556:31259:35869_CACCGTGGAACT 0 Svu-Bantam-o2-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT FFFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:2122:2853:8093_CACCGTGGAACC 0 Svu-Mir-87-o1-pre 41 255 23M 0 0 TTGAGCAACGAATATGTGCGGGA FFFFFFFFFFFFFFFFFFFFFF: XA:i:0 MD:Z:23 NM:i:0 A00126:243:HLFL7DSX3:3:2122:2853:8093_CACCGTGGAACT 0 Svu-Mir-87-o1-pre 41 255 23M 0 0 TTGAGCAACGAATATGTGCGGGA FFFFFFFFFFFFFFFFFFFFFF: XA:i:0 MD:Z:23 NM:i:0 A00126:243:HLFL7DSX3:3:2226:25852:27132_CACCGTGGAACC 0 Svu-Novel-14-Bantam-04-pre 7 255 21M 0 0 TGAGATCATGGATATGACCAT FFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:21 NM:i:0 A00126:243:HLFL7DSX3:3:2226:25852:27132_CACCGTGGAACT 0 Svu-Novel-14-Bantam-04-pre 7 255 21M 0 0 TGAGATCATGGATATGACCAT FFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:21 NM:i:0

Here is the -k 1 data:

Arguments [bam, -i, /Volumes/drylab/Data-Set-Go/hidden/results_samtools/0L1/0L1-subset_CACCGTGGAACC_sorted.bam, -o, /Volumes/drylab/Data-Set-Go/hidden/results_samtools/0L1/0L1-aligned-sorted-UMIcollapsed_test.bam, -k, 1] Done reading input file into memory! Number of input reads 18 Number of removed unmapped reads 0 Number of unremoved reads 18 Number of unique alignment positions 5 Average number of UMIs per alignment position 2.0 Max number of UMIs over all alignment positions 2 Number of reads after deduplicating 9 UMI collapsing finished in 0.146 seconds! A00126:243:HLFL7DSX3:3:1153:19723:34710_CACCGTGGAACT 0 Svu-Mir-279-o42-pre 7 255 22M 0 0 CTGGGTACGCATCTTAGTCGCG FFFFFFFFFFFFF:FFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:2556:31259:35869_CACCGTGGAACC 0 Svu-Bantam-o1-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT FFFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:2556:31259:35869_CACCGTGGAACT 0 Svu-Bantam-o1-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT FFFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:2556:31259:35869_CACCGTGGAACC 0 Svu-Bantam-o2-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT FFFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:2556:31259:35869_CACCGTGGAACT 0 Svu-Bantam-o2-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT FFFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:2122:2853:8093_CACCGTGGAACC 0 Svu-Mir-87-o1-pre 41 255 23M 0 0 TTGAGCAACGAATATGTGCGGGAFFFFFFFFFFFFFFFFFFFFFF: XA:i:0 MD:Z:23 NM:i:0 A00126:243:HLFL7DSX3:3:2122:2853:8093_CACCGTGGAACT 0 Svu-Mir-87-o1-pre 41 255 23M 0 0 TTGAGCAACGAATATGTGCGGGAFFFFFFFFFFFFFFFFFFFFFF: XA:i:0 MD:Z:23 NM:i:0 A00126:243:HLFL7DSX3:3:2226:25852:27132_CACCGTGGAACC 0 Svu-Novel-14-Bantam-04-pre 7 255 21M 0 0 TGAGATCATGGATATGACCAT FFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:21 NM:i:0 A00126:243:HLFL7DSX3:3:2226:25852:27132_CACCGTGGAACT 0 Svu-Novel-14-Bantam-04-pre 7 255 21M * 0 0 TGAGATCATGGATATGACCAT FFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:21 NM:i:0

In my naive opinion -k 1 seems better, however, I see reads that could be collapsed, but aren't. It seems that only two reads could be collapsed. What about these:

A00126:243:HLFL7DSX3:3:2556:31259:35869_CACCGTGGAACC 0 Svu-Bantam-o1-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT FFFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:2556:31259:35869_CACCGTGGAACT 0 Svu-Bantam-o1-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT FFFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:22 NM:i:0

Why aren't these collapsed?

Bantam is miRNA reported as one of the most abundant miRNAs according to the literature, and if UMIcollapses work, we could highlight what the stoichiometry really is like - not at all what the paper is about, but a curious thought!

Cheers and thanks for the easter script re-work. Much appreciated.

/Maibritt

MardahlM commented 5 months ago

Now that I revisit this, I don't understand why -k 0 is not returning more than 10 reads.

Daniel-Liu-c0deb0t commented 5 months ago

-k 0 seems to work fine? Each reads that map to the same reference and have the same UMI are collapsed.

For -k 1, UMICollapse uses the directional algorithm, which essentially clusters UMI v to UMI u if freq(v) <= freq(u) / 2 and u and v differ by only a single substitution. freq is a function that indicates how often a single UMI appears exactly. This avoids some overclustering by assuming that erroneous UMIs are rare. The exception to this rule is when freq(u) = freq(v) = 1, which is clustered (reason is because it helps cluster together many rare erroneous UMIs).

MardahlM commented 5 months ago

So for -k 0, essentially this means that for this example freq(u) = freq(v) = 1 and the reads are therefore not further clustered?

A00126:243:HLFL7DSX3:3:2556:31259:35869_CACCGTGGAACC 0 Svu-Bantam-o1-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT FFFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:22 NM:i:0 A00126:243:HLFL7DSX3:3:2556:31259:35869_CACCGTGGAACT 0 Svu-Bantam-o1-pre 46 255 22M 0 0 TGAGATCACGCGTATATTCGCT FFFFFFFFFFFFFFFFFFFFFF XA:i:0 MD:Z:22 NM:i:0

Daniel-Liu-c0deb0t commented 5 months ago

For -k 0, those two are not clustered together because the UMIs differ by one bp.

For -k 1, both have frequencies of 2 (not 1), so they are not clustered together. If they have frequencies of 1, then they would be clustered together.

MardahlM commented 5 months ago

Thanks a lot for your help Daniel.

FYI, as I am switching between computers (processors) I thought I'd let you know that snappy version needs to be 1.1.84 or later for Mac with an M1 processor running aarch64.

WARNING 2024-04-03 19:07:33 SnappyLoader Snappy native library failed to load. org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] no native library is found for os.name=Mac and os.arch=aarch64