CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
491 stars 190 forks source link

Dedup error: 'a' cannot be empty unless no samples are taken #625

Closed NordinZandhuis closed 9 months ago

NordinZandhuis commented 9 months ago

Hello!

Many thanks for making this package.

I have run into an error while trying to run the dedup function.

My bam files are indexed using Samtools and are structured like this:

SRR5335818:2378778:2378778:1:RANDOM:BARCODE:DDD_ATTGGCA:ATCG:ATA 272 1 3083242 0 18M2S 0 0 TTTTTTTTTTGTTTAACTGT ?EE0?GE00B1D3BGGF1GF NH:i:10 HI:i:4 AS:i:17 nM:i:0 SRR5335818:1873368:1873368:1:RANDOM:BARCODE:DDD_ACTAACA:ATCG:CTT 256 1 3133576 0 39M 0 0 TTTATTTATTTATTTATTATACGTAAGTACACTGTAGCT GGGGGGGGHHHHHHHHHHHHHFHHHHHHHHHHHHHHHHH NH:i:7 HI:i:2 AS:i:38 nM:i:0 SRR5335818:1886451:1886451:1:RANDOM:BARCODE:DDD_ACTAACA:ATCG:CTT 256 1 3133576 0 39M 0 0 TTTATTTATTTATTTATTATACGTAAGTACACTGTAGCT GGGGGGGGHHHHHHHHHHHHHFHHHHHHHHHHHHHHHHH NH:i:7 HI:i:2 AS:i:38 nM:i:0 SRR5335818:1949915:1949915:1:RANDOM:BARCODE:DDD_TGGCATC:ATCG:ATC 0 1 3133576 0 44M 0 0 TTTATTTATTTATTTATTATACGTAAGTACACTGTAGCTGTCTT GGGGGGGGHHHHHHHHHHHHHGHHHHHHHHHHHHHHHHHHHHHH NH:i:5 HI:i:1 AS:i:43 nM:i:0 SRR5335818:1647744:1647744:1:RANDOM:BARCODE:DDD_AGTTCGT:ATCG:ACT 256 1 3172271 1 26M 0 0 TGAACGATACAGAGAAGATTAGCATG 1BFEFACGFEHHHGFHHGHHG1FHG1 NH:i:3 HI:i:3 AS:i:25 nM:i:0 SRR5335818:1674043:1674043:1:RANDOM:BARCODE:DDD_AGTTCGT:ATCG:ACT 256 1 3172271 1 26M 0 0 TGAACGATACAGAGAAGATTAGCATG 11BGGGGGGGHHHHHHHHHHHGHHHH NH:i:3 HI:i:3 AS:i:25 nM:i:0 SRR5335818:3058822:3058822:1:RANDOM:BARCODE:DDD_TTGTTTC:ATCG:CCA 0 1 3238747 3 39M 0 0 TATTTGCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA FGFGFDFGHHHHGGGGGGGGGGGGGGGGGGGGGGGGGGG NH:i:2 HI:i:1 AS:i:38 nM:i:0 SRR5335818:201076:201076:1:RANDOM:BARCODE:DDD_TACCCTT:ATCG:CCT 256 1 3269562 3 18M2S 0 0 TTTACAGTGGTTCTAAAAAA EGGGGGGGHHHGHHHHHGHC NH:i:2 HI:i:1 AS:i:17 nM:i:0 SRR5335818:267434:267434:1:RANDOM:BARCODE:DDD_TACCCTT:ATCG:CCT 256 1 3269562 3 18M2S 0 0 TTTACAGTGGTTCTAAAAAA GGGGGGGGHHGGHHHHHHHG NH:i:2 HI:i:1 AS:i:17 nM:i:0 SRR5335818:296023:296023:1:RANDOM:BARCODE:DDD_TACCCTT:ATCG:CCT 256 1 3269562 3 18M2S 0 0 TTTACAGTGGTTCTAAAAAA GGGGGGGGHHGGHHHHHHHG NH:i:2 HI:i:1 AS:i:17 nM:i:0

The UMIs are called RANDOM and are the first 7 nucleotides after: RANDOM:BARCODE:DDD_

I am trying to run dedup on these BAM files using the following code:

for f in .bam do prefix="${f%.bam}" srr_part=$(echo "$prefix" | grep -o -e 'SRR[^.]')

echo "Busy with ${f}"

umitools dedup --stdin=${f} --extract-umi-method=tag --umi-separator="" --umi-tag="RANDOM:BARCODE:DDD" --umi-tag-split=":" --output-stats=${f} --log="$srr_part"LOGFILE -S deduplicated${f}

I unfortunately get the following error:

ValueError: 'a' cannot be empty unless no samples are taken

It seems like the UMIs are not able to be detected by umi tools, as the log file mentions: INFO total_umis 0

Some help would be greatly appreciated!

Many thanks,

Nordin

IanSudbery commented 9 months ago

Hi Nordin,

This error message actually comes from the stats generation part of the code, but it is happening because no UMIs are found.

No UMIs are being found because of the use of --umi-tag. --umi-tag is designed to be used when your UMI is encoded as a SAM tag in the last column of the entry - e.g. UX:x:TGGCATC as in

SRR5335818:1949915:1949915:1 0 1 3133576 0 44M * 0 0 TTTATTTATTTATTTATTATACGTAAGTACACTGTAGCTGTCTT GGGGGGGGHHHHHHHHHHHHHGHHHHHHHHHHHHHHHHHHHHHH NH:i:5 HI:i:1 AS:i:43 nM:i:0 UX:x:TGGCATC

The alternative to putting the UMI in a SAM tag (and the default behaviour), that the UMI is extracted as the last part of the read name (separated by the UMI separator). For example:

SRR5335818:1949915:1949915:1_TGGCATC 0 1 3133576 0 44M * 0 0 TTTATTTATTTATTTATTATACGTAAGTACACTGTAGCTGTCTT GGGGGGGGHHHHHHHHHHHHHGHHHHHHHHHHHHHHHHHHHHHH NH:i:5 HI:i:1 AS:i:43 nM:i:0 

What is the geometry of the sequences at the end of your read name? There appear to be three of them after RANDOM:BARCODE:DDD.

NordinZandhuis commented 9 months ago

Thanks a lot for the reply and the explanation. I have restructured the header, making it compatible with umi_tools.