Closed NordinZandhuis closed 9 months ago
Hi Nordin,
This error message actually comes from the stats generation part of the code, but it is happening because no UMIs are found.
No UMIs are being found because of the use of --umi-tag
. --umi-tag
is designed to be used when your UMI is encoded as a SAM tag in the last column of the entry - e.g. UX:x:TGGCATC
as in
SRR5335818:1949915:1949915:1 0 1 3133576 0 44M * 0 0 TTTATTTATTTATTTATTATACGTAAGTACACTGTAGCTGTCTT GGGGGGGGHHHHHHHHHHHHHGHHHHHHHHHHHHHHHHHHHHHH NH:i:5 HI:i:1 AS:i:43 nM:i:0 UX:x:TGGCATC
The alternative to putting the UMI in a SAM tag (and the default behaviour), that the UMI is extracted as the last part of the read name (separated by the UMI separator). For example:
SRR5335818:1949915:1949915:1_TGGCATC 0 1 3133576 0 44M * 0 0 TTTATTTATTTATTTATTATACGTAAGTACACTGTAGCTGTCTT GGGGGGGGHHHHHHHHHHHHHGHHHHHHHHHHHHHHHHHHHHHH NH:i:5 HI:i:1 AS:i:43 nM:i:0
What is the geometry of the sequences at the end of your read name? There appear to be three of them after RANDOM:BARCODE:DDD.
Thanks a lot for the reply and the explanation. I have restructured the header, making it compatible with umi_tools.
Hello!
Many thanks for making this package.
I have run into an error while trying to run the dedup function.
My bam files are indexed using Samtools and are structured like this:
SRR5335818:2378778:2378778:1:RANDOM:BARCODE:DDD_ATTGGCA:ATCG:ATA 272 1 3083242 0 18M2S 0 0 TTTTTTTTTTGTTTAACTGT ?EE0?GE00B1D3BGGF1GF NH:i:10 HI:i:4 AS:i:17 nM:i:0 SRR5335818:1873368:1873368:1:RANDOM:BARCODE:DDD_ACTAACA:ATCG:CTT 256 1 3133576 0 39M 0 0 TTTATTTATTTATTTATTATACGTAAGTACACTGTAGCT GGGGGGGGHHHHHHHHHHHHHFHHHHHHHHHHHHHHHHH NH:i:7 HI:i:2 AS:i:38 nM:i:0 SRR5335818:1886451:1886451:1:RANDOM:BARCODE:DDD_ACTAACA:ATCG:CTT 256 1 3133576 0 39M 0 0 TTTATTTATTTATTTATTATACGTAAGTACACTGTAGCT GGGGGGGGHHHHHHHHHHHHHFHHHHHHHHHHHHHHHHH NH:i:7 HI:i:2 AS:i:38 nM:i:0 SRR5335818:1949915:1949915:1:RANDOM:BARCODE:DDD_TGGCATC:ATCG:ATC 0 1 3133576 0 44M 0 0 TTTATTTATTTATTTATTATACGTAAGTACACTGTAGCTGTCTT GGGGGGGGHHHHHHHHHHHHHGHHHHHHHHHHHHHHHHHHHHHH NH:i:5 HI:i:1 AS:i:43 nM:i:0 SRR5335818:1647744:1647744:1:RANDOM:BARCODE:DDD_AGTTCGT:ATCG:ACT 256 1 3172271 1 26M 0 0 TGAACGATACAGAGAAGATTAGCATG 1BFEFACGFEHHHGFHHGHHG1FHG1 NH:i:3 HI:i:3 AS:i:25 nM:i:0 SRR5335818:1674043:1674043:1:RANDOM:BARCODE:DDD_AGTTCGT:ATCG:ACT 256 1 3172271 1 26M 0 0 TGAACGATACAGAGAAGATTAGCATG 11BGGGGGGGHHHHHHHHHHHGHHHH NH:i:3 HI:i:3 AS:i:25 nM:i:0 SRR5335818:3058822:3058822:1:RANDOM:BARCODE:DDD_TTGTTTC:ATCG:CCA 0 1 3238747 3 39M 0 0 TATTTGCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA FGFGFDFGHHHHGGGGGGGGGGGGGGGGGGGGGGGGGGG NH:i:2 HI:i:1 AS:i:38 nM:i:0 SRR5335818:201076:201076:1:RANDOM:BARCODE:DDD_TACCCTT:ATCG:CCT 256 1 3269562 3 18M2S 0 0 TTTACAGTGGTTCTAAAAAA EGGGGGGGHHHGHHHHHGHC NH:i:2 HI:i:1 AS:i:17 nM:i:0 SRR5335818:267434:267434:1:RANDOM:BARCODE:DDD_TACCCTT:ATCG:CCT 256 1 3269562 3 18M2S 0 0 TTTACAGTGGTTCTAAAAAA GGGGGGGGHHGGHHHHHHHG NH:i:2 HI:i:1 AS:i:17 nM:i:0 SRR5335818:296023:296023:1:RANDOM:BARCODE:DDD_TACCCTT:ATCG:CCT 256 1 3269562 3 18M2S 0 0 TTTACAGTGGTTCTAAAAAA GGGGGGGGHHGGHHHHHHHG NH:i:2 HI:i:1 AS:i:17 nM:i:0
The UMIs are called RANDOM and are the first 7 nucleotides after: RANDOM:BARCODE:DDD_
I am trying to run dedup on these BAM files using the following code:
for f in .bam do prefix="${f%.bam}" srr_part=$(echo "$prefix" | grep -o -e 'SRR[^.]')
echo "Busy with ${f}"
umitools dedup --stdin=${f} --extract-umi-method=tag --umi-separator="" --umi-tag="RANDOM:BARCODE:DDD" --umi-tag-split=":" --output-stats=${f} --log="$srr_part"LOGFILE -S deduplicated${f}
I unfortunately get the following error:
ValueError: 'a' cannot be empty unless no samples are taken
It seems like the UMIs are not able to be detected by umi tools, as the log file mentions: INFO total_umis 0
Some help would be greatly appreciated!
Many thanks,
Nordin