CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
491 stars 190 forks source link

AssertionError: not all umis are the same length(!): 6 - 7 #507

Closed Matt-Schmitz closed 2 years ago

Matt-Schmitz commented 2 years ago

Hi, I'm completely new to bioinformatical analysis, so I hope this is enough information to diagnose my problem.

I got the following error after running umi_tools dedup -I input.bam --output-stats=deduplicated -S output.bam

2022-01-22 18:22:11,452 INFO total_umis 8084052
2022-01-22 18:22:11,452 INFO #umis 6586011
Traceback (most recent call last):
  File "/home/matt/anaconda3/envs/umitools/bin/umi_tools", line 11, in <module>
    sys.exit(main())
  File "/home/matt/anaconda3/envs/umitools/lib/python3.7/site-packages/umi_tools/umi_tools.py", line 61, in main
    module.main(sys.argv)
  File "/home/matt/anaconda3/envs/umitools/lib/python3.7/site-packages/umi_tools/dedup.py", line 331, in main
    threshold=options.threshold)
  File "/home/matt/anaconda3/envs/umitools/lib/python3.7/site-packages/umi_tools/network.py", line 419, in __call__
    clusters = self.UMIClusterer(counts, threshold)
  File "/home/matt/anaconda3/envs/umitools/lib/python3.7/site-packages/umi_tools/network.py", line 369, in __call__
    min(len_umis), max(len_umis)))
AssertionError: not all umis are the same length(!):  6 - 7

Head of BAM file (samtools view input.bam | head -n 6):

6073959 0       1       3125394 3       28M     *       0       0       AACAACAACAACAACAACAACGTATCTA    *       nM:i:2  MD:Z:25A1G0     NH:i:2
5614728 0       1       3567528 255     31M     *       0       0       TCTCCCCCCCCCCCCGTTAAAAAATTTATGT *       nM:i:7  MD:Z:18T0T0T2T3G1T0C0   NH:i:1
4404284 272     1       3717890 3       22M     *       0       0       TTTGAATTGATCATGGTAGTTG  *       nM:i:5  MD:Z:0C0C0C0T1C16       NH:i:2
623104  0       1       3741776 255     24M     *       0       0       AGTGGTGGCCATGTTTTTTTAAGG        *       nM:i:4  MD:Z:16G1G1C2A0 NH:i:1
3557856 0       1       3743254 255     46M     *       0       0       AGTGGTTGGTTTTTTTTTATTTTTTTGTTTTTGGTTATGTTGAAAA  *       nM:i:9  MD:Z:23G2C1C4C0C0C1G1G4G1       NH:i:1
3947152 0       1       3819739 255     25M     *       0       0       TATGAGAGAGATAGAGAGGGAAAGA       *       nM:i:3  MD:Z:0A0T0A22   NH:i:1

Head of fastq file:

@NB500937:552:HKNVMAFX2:1:11101:17965:1104_ACGGCC 1:N:0:GATCACAT+NTGGCGGT
CATCAACTTCCCACTGTACACCACCACATCAATCAAATTCTCCTTCATTATTAGCCTCTTAC
+
EEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEE
@NB500937:552:HKNVMAFX2:1:11101:11576:1104_ATGGAA 1:N:0:GATCACAT+NTGGCGGT
CGCGGTTCTATTTTGTTGGTTTTCGGAACTGAGGCCATGATTAAGAGGGACGGCCGGGGGCATTCGTATTGCGCCGCTAGAGGTGAAATTCTTGGACCGGCGCAAGACG
+
EEEEEEEEEEEEEEEEEEE/EEEEEEEEEEEEEEEEEEEEAEEEEAEEE<EEA/A/EE<A/AEE/EEAEEEE/EEEEEEEE/EAAAEEEE<AEEE//AAE<EEAEE<AE
@NB500937:552:HKNVMAFX2:1:11101:7931:1105_AAGAAC 1:N:0:GATCACAT+NTGGCGGT
GAAAGATGAGTGAGGCTGCAGCCAAATGGTCTGCCTTTTATGAAGAACAGACTAAGACTGCACAAAGTTTCTCACTACAAGAAATCCA

Here my UMIs are ACGGCC, ATGGAA, and AAGAAC. I put the first 100 000 lines of the fastq file in a text document and couldn't find any UMIs of length 7. I also don't understand how UMIs are stored in BAM files.

TomSmithCGAT commented 2 years ago

Hi Matthius.

Thanks for the nicely detailed issue 😄

UMI-tools dedup command extracts the UMI from the query name (first field), taking the last element after splitting with '_' as the delimeter. The aligner should use the name from the fastq as the query name, e.g @NB500937:552:HKNVMAFX2:1:11101:17965:1104_ACGGCC -> UMI = ACGGCC for your first read in the fastq above. However, you seem to have numerical query names, e.g 6073959 in the first record in the BAM above.

This is why UMI-tools is complaining about different UMI lengths, as it's using the whole of that query name as the UMI.

What aligner did you use and were there any other steps inbetween alignment and dedup?

Matt-Schmitz commented 2 years ago

Hi Tom,

Thank you for your quick reply! I used the STAR aligner. Do you know how to get STAR to take the UMIs into account in bulk RNA-Seq? Here are the steps I took:

I used a tool called GEDI (https://github.com/erhard-lab/gedi) to prepare the mouse genome for indexing (and the spike-in control (ERCC)).

I created the STAR index: nohup STAR --runMode genomeGenerate --runThreadN 8 --genomeDir mus_ERCC_index --genomeFastaFiles /home/matt/data/test/mus_musculus.101.fasta /home/matt/data/test/ERCC92.fa --sjdbGTFfile mus_ERCC.gtf &

I created a pipeline for the mapping using GEDI and then ran the bash file that GEDI created which contains the following command: STAR --runMode alignReads --runThreadN 8 --genomeDir /home/matt/data/mus_ERCC_index --limitBAMsortRAM 8000000000 --outFilterMismatchNmax 20 --outFilterScoreMinOverLread 0.4 --outFilterMatchNminOverLread 0.4 --readFilesIn C01__1_IR.fastq --outSAMmode NoQS --outSAMtype BAM SortedByCoordinate --alignEndsType EndToEnd --outSAMattributes nM MD NH

TomSmithCGAT commented 2 years ago

I can't see any STAR command there that should remove the query name. Are there other commands in the script after STAR? Assuming there is nothing confidential in there, you can share the whole bash script here

Matt-Schmitz commented 2 years ago

start.bash:

#!/bin/bash

echo Starting V01__1_IR
/home/matt/data/./mapping/scripts/V01__1_IR.bash >> /home/matt/data/./mapping/log/V01__1_IR.out 2>> /home/matt/data/./mapping/log/V01__1_IR.err

echo Starting C01__1_IR
/home/matt/data/./mapping/scripts/C01__1_IR.bash >> /home/matt/data/./mapping/log/C01__1_IR.out 2>> /home/matt/data/./mapping/log/C01__1_IR.err

echo Starting V05__1_IR
/home/matt/data/./mapping/scripts/V05__1_IR.bash >> /home/matt/data/./mapping/log/V05__1_IR.out 2>> /home/matt/data/./mapping/log/V05__1_IR.err

echo Starting C05__1_IR
/home/matt/data/./mapping/scripts/C05__1_IR.bash >> /home/matt/data/./mapping/log/C05__1_IR.out 2>> /home/matt/data/./mapping/log/C05__1_IR.err

echo Starting V10__1_IR
/home/matt/data/./mapping/scripts/V10__1_IR.bash >> /home/matt/data/./mapping/log/V10__1_IR.out 2>> /home/matt/data/./mapping/log/V10__1_IR.err

echo Starting C10__1_IR
/home/matt/data/./mapping/scripts/C10__1_IR.bash >> /home/matt/data/./mapping/log/C10__1_IR.out 2>> /home/matt/data/./mapping/log/C10__1_IR.err

echo Starting V25__1_IR
/home/matt/data/./mapping/scripts/V25__1_IR.bash >> /home/matt/data/./mapping/log/V25__1_IR.out 2>> /home/matt/data/./mapping/log/V25__1_IR.err

echo Starting C25__1_IR
/home/matt/data/./mapping/scripts/C25__1_IR.bash >> /home/matt/data/./mapping/log/C25__1_IR.out 2>> /home/matt/data/./mapping/log/C25__1_IR.err

echo Starting C25__2
/home/matt/data/./mapping/scripts/C25__2.bash >> /home/matt/data/./mapping/log/C25__2.out 2>> /home/matt/data/./mapping/log/C25__2.err

echo Starting C01__2
/home/matt/data/./mapping/scripts/C01__2.bash >> /home/matt/data/./mapping/log/C01__2.out 2>> /home/matt/data/./mapping/log/C01__2.err

echo Starting V01__2
/home/matt/data/./mapping/scripts/V01__2.bash >> /home/matt/data/./mapping/log/V01__2.out 2>> /home/matt/data/./mapping/log/V01__2.err

echo Starting C25__3
/home/matt/data/./mapping/scripts/C25__3.bash >> /home/matt/data/./mapping/log/C25__3.out 2>> /home/matt/data/./mapping/log/C25__3.err

echo Starting C05__2
/home/matt/data/./mapping/scripts/C05__2.bash >> /home/matt/data/./mapping/log/C05__2.out 2>> /home/matt/data/./mapping/log/C05__2.err

echo Starting V10__2
/home/matt/data/./mapping/scripts/V10__2.bash >> /home/matt/data/./mapping/log/V10__2.out 2>> /home/matt/data/./mapping/log/V10__2.err

echo Starting C10__2
/home/matt/data/./mapping/scripts/C10__2.bash >> /home/matt/data/./mapping/log/C10__2.out 2>> /home/matt/data/./mapping/log/C10__2.err

echo Starting C10__3
/home/matt/data/./mapping/scripts/C10__3.bash >> /home/matt/data/./mapping/log/C10__3.out 2>> /home/matt/data/./mapping/log/C10__3.err

echo Starting V05__2
/home/matt/data/./mapping/scripts/V05__2.bash >> /home/matt/data/./mapping/log/V05__2.out 2>> /home/matt/data/./mapping/log/V05__2.err

echo Starting C05__3
/home/matt/data/./mapping/scripts/C05__3.bash >> /home/matt/data/./mapping/log/C05__3.out 2>> /home/matt/data/./mapping/log/C05__3.err

echo Starting V25__2
/home/matt/data/./mapping/scripts/V25__2.bash >> /home/matt/data/./mapping/log/V25__2.out 2>> /home/matt/data/./mapping/log/V25__2.err

echo Starting V10__3
/home/matt/data/./mapping/scripts/V10__3.bash >> /home/matt/data/./mapping/log/V10__3.out 2>> /home/matt/data/./mapping/log/V10__3.err

echo Starting V05__3
/home/matt/data/./mapping/scripts/V05__3.bash >> /home/matt/data/./mapping/log/V05__3.out 2>> /home/matt/data/./mapping/log/V05__3.err

echo Starting C01__3
/home/matt/data/./mapping/scripts/C01__3.bash >> /home/matt/data/./mapping/log/C01__3.out 2>> /home/matt/data/./mapping/log/C01__3.err

echo Starting V01__3
/home/matt/data/./mapping/scripts/V01__3.bash >> /home/matt/data/./mapping/log/V01__3.out 2>> /home/matt/data/./mapping/log/V01__3.err

echo Starting V25__3
/home/matt/data/./mapping/scripts/V25__3.bash >> /home/matt/data/./mapping/log/V25__3.out 2>> /home/matt/data/./mapping/log/V25__3.err

echo Starting GF_C05
/home/matt/data/./mapping/scripts/GF_C05.bash >> /home/matt/data/./mapping/log/GF_C05.out 2>> /home/matt/data/./mapping/log/GF_C05.err

echo Starting GF_V05
/home/matt/data/./mapping/scripts/GF_V05.bash >> /home/matt/data/./mapping/log/GF_V05.out 2>> /home/matt/data/./mapping/log/GF_V05.err

echo Starting I09_761
/home/matt/data/./mapping/scripts/I09_761.bash >> /home/matt/data/./mapping/log/I09_761.out 2>> /home/matt/data/./mapping/log/I09_761.err

echo Starting I09_764
/home/matt/data/./mapping/scripts/I09_764.bash >> /home/matt/data/./mapping/log/I09_764.out 2>> /home/matt/data/./mapping/log/I09_764.err

echo Starting U09__2
/home/matt/data/./mapping/scripts/U09__2.bash >> /home/matt/data/./mapping/log/U09__2.out 2>> /home/matt/data/./mapping/log/U09__2.err

echo Starting U09__3
/home/matt/data/./mapping/scripts/U09__3.bash >> /home/matt/data/./mapping/log/U09__3.out 2>> /home/matt/data/./mapping/log/U09__3.err

echo Starting I09_762
/home/matt/data/./mapping/scripts/I09_762.bash >> /home/matt/data/./mapping/log/I09_762.out 2>> /home/matt/data/./mapping/log/I09_762.err

echo Starting I09_757
/home/matt/data/./mapping/scripts/I09_757.bash >> /home/matt/data/./mapping/log/I09_757.out 2>> /home/matt/data/./mapping/log/I09_757.err

echo Starting GF_V10
/home/matt/data/./mapping/scripts/GF_V10.bash >> /home/matt/data/./mapping/log/GF_V10.out 2>> /home/matt/data/./mapping/log/GF_V10.err

echo Starting GF_C10
/home/matt/data/./mapping/scripts/GF_C10.bash >> /home/matt/data/./mapping/log/GF_C10.out 2>> /home/matt/data/./mapping/log/GF_C10.err

echo Starting I09inf_766
/home/matt/data/./mapping/scripts/I09inf_766.bash >> /home/matt/data/./mapping/log/I09inf_766.out 2>> /home/matt/data/./mapping/log/I09inf_766.err

echo Starting I09uninf_766
/home/matt/data/./mapping/scripts/I09uninf_766.bash >> /home/matt/data/./mapping/log/I09uninf_766.out 2>> /home/matt/data/./mapping/log/I09uninf_766.err

echo Starting U09__1
/home/matt/data/./mapping/scripts/U09__1.bash >> /home/matt/data/./mapping/log/U09__1.out 2>> /home/matt/data/./mapping/log/U09__1.err

echo Starting I09_766
/home/matt/data/./mapping/scripts/I09_766.bash >> /home/matt/data/./mapping/log/I09_766.out 2>> /home/matt/data/./mapping/log/I09_766.err

echo Starting GF_V25_IRMS
/home/matt/data/./mapping/scripts/GF_V25_IRMS.bash >> /home/matt/data/./mapping/log/GF_V25_IRMS.out 2>> /home/matt/data/./mapping/log/GF_V25_IRMS.err

echo Starting GF_V25_IR
/home/matt/data/./mapping/scripts/GF_V25_IR.bash >> /home/matt/data/./mapping/log/GF_V25_IR.out 2>> /home/matt/data/./mapping/log/GF_V25_IR.err

echo Starting GF_C25_IR
/home/matt/data/./mapping/scripts/GF_C25_IR.bash >> /home/matt/data/./mapping/log/GF_C25_IR.out 2>> /home/matt/data/./mapping/log/GF_C25_IR.err

echo Starting GF_V01_IR
/home/matt/data/./mapping/scripts/GF_V01_IR.bash >> /home/matt/data/./mapping/log/GF_V01_IR.out 2>> /home/matt/data/./mapping/log/GF_V01_IR.err

echo Starting GF_C01_IR
/home/matt/data/./mapping/scripts/GF_C01_IR.bash >> /home/matt/data/./mapping/log/GF_C01_IR.out 2>> /home/matt/data/./mapping/log/GF_C01_IR.err

echo Starting mapping.merge
gedi -e MergeCIT -c /home/matt/data/./mapping/mapping.cit /home/matt/data/./mapping/V01__1_IR.cit /home/matt/data/./mapping/C01__1_IR.cit /home/matt/data/./mapping/V05__1_IR.cit /home/matt/data/./mapping/C05__1_IR.cit /home/matt/data/./mapping/V10__1_IR.cit /home/matt/data/./mapping/C10__1_IR.cit /home/matt/data/./mapping/V25__1_IR.cit /home/matt/data/./mapping/C25__1_IR.cit /home/matt/data/./mapping/C25__2.cit /home/matt/data/./mapping/C01__2.cit /home/matt/data/./mapping/V01__2.cit /home/matt/data/./mapping/C25__3.cit /home/matt/data/./mapping/C05__2.cit /home/matt/data/./mapping/V10__2.cit /home/matt/data/./mapping/C10__2.cit /home/matt/data/./mapping/C10__3.cit /home/matt/data/./mapping/V05__2.cit /home/matt/data/./mapping/C05__3.cit /home/matt/data/./mapping/V25__2.cit /home/matt/data/./mapping/V10__3.cit /home/matt/data/./mapping/V05__3.cit /home/matt/data/./mapping/C01__3.cit /home/matt/data/./mapping/V01__3.cit /home/matt/data/./mapping/V25__3.cit /home/matt/data/./mapping/GF_C05.cit /home/matt/data/./mapping/GF_V05.cit /home/matt/data/./mapping/I09_761.cit /home/matt/data/./mapping/I09_764.cit /home/matt/data/./mapping/U09__2.cit /home/matt/data/./mapping/U09__3.cit /home/matt/data/./mapping/I09_762.cit /home/matt/data/./mapping/I09_757.cit /home/matt/data/./mapping/GF_V10.cit /home/matt/data/./mapping/GF_C10.cit /home/matt/data/./mapping/I09inf_766.cit /home/matt/data/./mapping/I09uninf_766.cit /home/matt/data/./mapping/U09__1.cit /home/matt/data/./mapping/I09_766.cit /home/matt/data/./mapping/GF_V25_IRMS.cit /home/matt/data/./mapping/GF_V25_IR.cit /home/matt/data/./mapping/GF_C25_IR.cit /home/matt/data/./mapping/GF_V01_IR.cit /home/matt/data/./mapping/GF_C01_IR.cit  >> /home/matt/data/./mapping/log/mapping.merge.out 2>> /home/matt/data/./mapping/log/mapping.merge.err

#!/bin/bash

echo Starting mapping.report
gedi -e Stats -prefix /home/matt/data/./mapping/report/mapping. -g  m_musculus.ens101 ERCC92 /home/matt/data/./mapping/mapping.cit >> /home/matt/data/./mapping/log/mapping.report.out 2>> /home/matt/data/./mapping/log/mapping.report.err
mkdir -p /home/matt/data/./mapping/counts
echo Starting mapping.count
gedi -e Stats -prefix /home/matt/data/./mapping/counts/mapping. -g  m_musculus.ens101 ERCC92 -count /home/matt/data/./mapping/mapping.cit >> /home/matt/data/./mapping/log/mapping.count.out 2>> /home/matt/data/./mapping/log/mapping.count.err

V01__1_IR.bash

#!/bin/bash

mkdir -p /tmp/V01__1_IR
cd /tmp/V01__1_IR

    zcat /home/matt/data/raw/round0/round3/V01_1__3.fq.gz > V01__1_IR.fastq

echo -e "Category\tCount" > V01__1_IR.reads.tsv
echo -ne "All\t" >> V01__1_IR.reads.tsv

L=`wc -l V01__1_IR.fastq | head -n1 | awk '{print $1}'`
leftreads=$((L / 4))
echo $leftreads >> V01__1_IR.reads.tsv

gedi -t . -e FastqFilter -D -overwrite -ld V01__1_IR.readlengths.tsv -min 18 V01__1_IR.fastq

    echo -ne "Trimmed\t" >> V01__1_IR.reads.tsv
    L=`wc -l V01__1_IR.fastq | head -n1 | awk '{print $1}'`
    leftreads=$((L / 4))
    echo $leftreads >> V01__1_IR.reads.tsv

# mapping
STAR --runMode alignReads --runThreadN 8   --genomeDir /home/matt/data/mus_ERCC_index  --limitBAMsortRAM 8000000000 --outFilterMismatchNmax 20 --outFilterScoreMinOverLread 0.4 --outFilterMatchNminOverLread 0.4 --readFilesIn V01__1_IR.fastq --outSAMmode NoQS --outSAMtype BAM SortedByCoordinate --alignEndsType EndToEnd --outSAMattributes nM MD NH  
mv Aligned.sortedByCoord.out.bam V01__1_IR.bam
echo -ne "Unique\t" >> V01__1_IR.reads.tsv
grep "Uniquely mapped reads number"  Log.final.out | cut -f2 -d'|' | awk '{ print $1}' >> V01__1_IR.reads.tsv
echo -ne "Multi\t" >> V01__1_IR.reads.tsv
grep "Number of reads mapped to multiple loci"  Log.final.out | cut -f2 -d'|' | awk '{ print $1}' >> V01__1_IR.reads.tsv

mkdir -p /home/matt/data/./mapping/report

samtools index V01__1_IR.bam

echo -ne "Mitochondrial\t" >> V01__1_IR.reads.tsv
gedi -t . -e Bam2CIT  V01__1_IR.cit V01__1_IR.bam | tail -n1 | awk '{ print $2 }'>> V01__1_IR.reads.tsv
gedi -t . -e CorrectCIT V01__1_IR.cit
echo -ne "CIT\t" >> V01__1_IR.reads.tsv
gedi -t . -e ReadCount -g m_musculus.ens101 ERCC92 -m Weight V01__1_IR.cit | tail -n1 | awk '{ print $2 }'>> V01__1_IR.reads.tsv

mv *.readlengths.* /home/matt/data/./mapping/report

mv V01__1_IR.reads.tsv /home/matt/data/./mapping/report

if [ -f V01__1_IR.cit ]; then
   mv V01__1_IR.cit* /home/matt/data/./mapping
   mkdir -p /home/matt/data/./mapping/bams
   mv V01__1_IR.bam* /home/matt/data/./mapping/bams
   cd /home/matt/data/./mapping
   rm -rf /tmp/V01__1_IR
else
   (>&2 echo "There were some errors, did not delete temp directory!")
   cd /home/matt/data/./mapping
fi

Rscript report/V01__1_IR.reads.R
echo '{"plots":[{"section":"Mapping statistics","id":"mappingV01__1_IR","title":"V01__1_IR","description":"How many reads are removed by adapter trimming and rRNA mapping, how many are mapped to which reference and retained overall.","img":"V01__1_IR.reads.png","script":"V01__1_IR.reads.R","csv":"V01__1_IR.reads.tsv"}]}' > report/V01__1_IR.reads.report.json 
TomSmithCGAT commented 2 years ago

Looks like the issue may be pre-STAR.

I've not come across gedi before, but it sounds like their pipeline requires the read queries to be integers: https://github.com/erhard-lab/price/issues/3

I'm not at all proficient in Java, but it from the source code, it appears the FastqFilter tool 'reindexes reads', but can also extract UMIs (-umi) and append these to the read names. If so, you will need to use gedi to perform the UMI extraction in order for the resultant BAMs to be compatible with umi_tools dedup https://github.com/erhard-lab/gedi/blob/cb99bb6622bb14be24dfa229409fd67d74d83657/GediPreprocess/src/executables/FastqFilter.java#L80

I can't find any documentation to confirm this though. Perhaps raise an issue here https://github.com/erhard-lab/gedi and link to this one.

Matt-Schmitz commented 2 years ago

Thank you very much!