Closed Matt-Schmitz closed 2 years ago
Hi Matthius.
Thanks for the nicely detailed issue 😄
UMI-tools dedup
command extracts the UMI from the query name (first field), taking the last element after splitting with '_' as the delimeter. The aligner should use the name from the fastq as the query name, e.g @NB500937:552:HKNVMAFX2:1:11101:17965:1104_ACGGCC
-> UMI = ACGGCC
for your first read in the fastq above. However, you seem to have numerical query names, e.g 6073959
in the first record in the BAM above.
This is why UMI-tools is complaining about different UMI lengths, as it's using the whole of that query name as the UMI.
What aligner did you use and were there any other steps inbetween alignment and dedup
?
Hi Tom,
Thank you for your quick reply! I used the STAR aligner. Do you know how to get STAR to take the UMIs into account in bulk RNA-Seq? Here are the steps I took:
I used a tool called GEDI (https://github.com/erhard-lab/gedi) to prepare the mouse genome for indexing (and the spike-in control (ERCC)).
I created the STAR index:
nohup STAR --runMode genomeGenerate --runThreadN 8 --genomeDir mus_ERCC_index --genomeFastaFiles /home/matt/data/test/mus_musculus.101.fasta /home/matt/data/test/ERCC92.fa --sjdbGTFfile mus_ERCC.gtf &
I created a pipeline for the mapping using GEDI and then ran the bash file that GEDI created which contains the following command:
STAR --runMode alignReads --runThreadN 8 --genomeDir /home/matt/data/mus_ERCC_index --limitBAMsortRAM 8000000000 --outFilterMismatchNmax 20 --outFilterScoreMinOverLread 0.4 --outFilterMatchNminOverLread 0.4 --readFilesIn C01__1_IR.fastq --outSAMmode NoQS --outSAMtype BAM SortedByCoordinate --alignEndsType EndToEnd --outSAMattributes nM MD NH
I can't see any STAR command there that should remove the query name. Are there other commands in the script after STAR? Assuming there is nothing confidential in there, you can share the whole bash script here
start.bash:
#!/bin/bash
echo Starting V01__1_IR
/home/matt/data/./mapping/scripts/V01__1_IR.bash >> /home/matt/data/./mapping/log/V01__1_IR.out 2>> /home/matt/data/./mapping/log/V01__1_IR.err
echo Starting C01__1_IR
/home/matt/data/./mapping/scripts/C01__1_IR.bash >> /home/matt/data/./mapping/log/C01__1_IR.out 2>> /home/matt/data/./mapping/log/C01__1_IR.err
echo Starting V05__1_IR
/home/matt/data/./mapping/scripts/V05__1_IR.bash >> /home/matt/data/./mapping/log/V05__1_IR.out 2>> /home/matt/data/./mapping/log/V05__1_IR.err
echo Starting C05__1_IR
/home/matt/data/./mapping/scripts/C05__1_IR.bash >> /home/matt/data/./mapping/log/C05__1_IR.out 2>> /home/matt/data/./mapping/log/C05__1_IR.err
echo Starting V10__1_IR
/home/matt/data/./mapping/scripts/V10__1_IR.bash >> /home/matt/data/./mapping/log/V10__1_IR.out 2>> /home/matt/data/./mapping/log/V10__1_IR.err
echo Starting C10__1_IR
/home/matt/data/./mapping/scripts/C10__1_IR.bash >> /home/matt/data/./mapping/log/C10__1_IR.out 2>> /home/matt/data/./mapping/log/C10__1_IR.err
echo Starting V25__1_IR
/home/matt/data/./mapping/scripts/V25__1_IR.bash >> /home/matt/data/./mapping/log/V25__1_IR.out 2>> /home/matt/data/./mapping/log/V25__1_IR.err
echo Starting C25__1_IR
/home/matt/data/./mapping/scripts/C25__1_IR.bash >> /home/matt/data/./mapping/log/C25__1_IR.out 2>> /home/matt/data/./mapping/log/C25__1_IR.err
echo Starting C25__2
/home/matt/data/./mapping/scripts/C25__2.bash >> /home/matt/data/./mapping/log/C25__2.out 2>> /home/matt/data/./mapping/log/C25__2.err
echo Starting C01__2
/home/matt/data/./mapping/scripts/C01__2.bash >> /home/matt/data/./mapping/log/C01__2.out 2>> /home/matt/data/./mapping/log/C01__2.err
echo Starting V01__2
/home/matt/data/./mapping/scripts/V01__2.bash >> /home/matt/data/./mapping/log/V01__2.out 2>> /home/matt/data/./mapping/log/V01__2.err
echo Starting C25__3
/home/matt/data/./mapping/scripts/C25__3.bash >> /home/matt/data/./mapping/log/C25__3.out 2>> /home/matt/data/./mapping/log/C25__3.err
echo Starting C05__2
/home/matt/data/./mapping/scripts/C05__2.bash >> /home/matt/data/./mapping/log/C05__2.out 2>> /home/matt/data/./mapping/log/C05__2.err
echo Starting V10__2
/home/matt/data/./mapping/scripts/V10__2.bash >> /home/matt/data/./mapping/log/V10__2.out 2>> /home/matt/data/./mapping/log/V10__2.err
echo Starting C10__2
/home/matt/data/./mapping/scripts/C10__2.bash >> /home/matt/data/./mapping/log/C10__2.out 2>> /home/matt/data/./mapping/log/C10__2.err
echo Starting C10__3
/home/matt/data/./mapping/scripts/C10__3.bash >> /home/matt/data/./mapping/log/C10__3.out 2>> /home/matt/data/./mapping/log/C10__3.err
echo Starting V05__2
/home/matt/data/./mapping/scripts/V05__2.bash >> /home/matt/data/./mapping/log/V05__2.out 2>> /home/matt/data/./mapping/log/V05__2.err
echo Starting C05__3
/home/matt/data/./mapping/scripts/C05__3.bash >> /home/matt/data/./mapping/log/C05__3.out 2>> /home/matt/data/./mapping/log/C05__3.err
echo Starting V25__2
/home/matt/data/./mapping/scripts/V25__2.bash >> /home/matt/data/./mapping/log/V25__2.out 2>> /home/matt/data/./mapping/log/V25__2.err
echo Starting V10__3
/home/matt/data/./mapping/scripts/V10__3.bash >> /home/matt/data/./mapping/log/V10__3.out 2>> /home/matt/data/./mapping/log/V10__3.err
echo Starting V05__3
/home/matt/data/./mapping/scripts/V05__3.bash >> /home/matt/data/./mapping/log/V05__3.out 2>> /home/matt/data/./mapping/log/V05__3.err
echo Starting C01__3
/home/matt/data/./mapping/scripts/C01__3.bash >> /home/matt/data/./mapping/log/C01__3.out 2>> /home/matt/data/./mapping/log/C01__3.err
echo Starting V01__3
/home/matt/data/./mapping/scripts/V01__3.bash >> /home/matt/data/./mapping/log/V01__3.out 2>> /home/matt/data/./mapping/log/V01__3.err
echo Starting V25__3
/home/matt/data/./mapping/scripts/V25__3.bash >> /home/matt/data/./mapping/log/V25__3.out 2>> /home/matt/data/./mapping/log/V25__3.err
echo Starting GF_C05
/home/matt/data/./mapping/scripts/GF_C05.bash >> /home/matt/data/./mapping/log/GF_C05.out 2>> /home/matt/data/./mapping/log/GF_C05.err
echo Starting GF_V05
/home/matt/data/./mapping/scripts/GF_V05.bash >> /home/matt/data/./mapping/log/GF_V05.out 2>> /home/matt/data/./mapping/log/GF_V05.err
echo Starting I09_761
/home/matt/data/./mapping/scripts/I09_761.bash >> /home/matt/data/./mapping/log/I09_761.out 2>> /home/matt/data/./mapping/log/I09_761.err
echo Starting I09_764
/home/matt/data/./mapping/scripts/I09_764.bash >> /home/matt/data/./mapping/log/I09_764.out 2>> /home/matt/data/./mapping/log/I09_764.err
echo Starting U09__2
/home/matt/data/./mapping/scripts/U09__2.bash >> /home/matt/data/./mapping/log/U09__2.out 2>> /home/matt/data/./mapping/log/U09__2.err
echo Starting U09__3
/home/matt/data/./mapping/scripts/U09__3.bash >> /home/matt/data/./mapping/log/U09__3.out 2>> /home/matt/data/./mapping/log/U09__3.err
echo Starting I09_762
/home/matt/data/./mapping/scripts/I09_762.bash >> /home/matt/data/./mapping/log/I09_762.out 2>> /home/matt/data/./mapping/log/I09_762.err
echo Starting I09_757
/home/matt/data/./mapping/scripts/I09_757.bash >> /home/matt/data/./mapping/log/I09_757.out 2>> /home/matt/data/./mapping/log/I09_757.err
echo Starting GF_V10
/home/matt/data/./mapping/scripts/GF_V10.bash >> /home/matt/data/./mapping/log/GF_V10.out 2>> /home/matt/data/./mapping/log/GF_V10.err
echo Starting GF_C10
/home/matt/data/./mapping/scripts/GF_C10.bash >> /home/matt/data/./mapping/log/GF_C10.out 2>> /home/matt/data/./mapping/log/GF_C10.err
echo Starting I09inf_766
/home/matt/data/./mapping/scripts/I09inf_766.bash >> /home/matt/data/./mapping/log/I09inf_766.out 2>> /home/matt/data/./mapping/log/I09inf_766.err
echo Starting I09uninf_766
/home/matt/data/./mapping/scripts/I09uninf_766.bash >> /home/matt/data/./mapping/log/I09uninf_766.out 2>> /home/matt/data/./mapping/log/I09uninf_766.err
echo Starting U09__1
/home/matt/data/./mapping/scripts/U09__1.bash >> /home/matt/data/./mapping/log/U09__1.out 2>> /home/matt/data/./mapping/log/U09__1.err
echo Starting I09_766
/home/matt/data/./mapping/scripts/I09_766.bash >> /home/matt/data/./mapping/log/I09_766.out 2>> /home/matt/data/./mapping/log/I09_766.err
echo Starting GF_V25_IRMS
/home/matt/data/./mapping/scripts/GF_V25_IRMS.bash >> /home/matt/data/./mapping/log/GF_V25_IRMS.out 2>> /home/matt/data/./mapping/log/GF_V25_IRMS.err
echo Starting GF_V25_IR
/home/matt/data/./mapping/scripts/GF_V25_IR.bash >> /home/matt/data/./mapping/log/GF_V25_IR.out 2>> /home/matt/data/./mapping/log/GF_V25_IR.err
echo Starting GF_C25_IR
/home/matt/data/./mapping/scripts/GF_C25_IR.bash >> /home/matt/data/./mapping/log/GF_C25_IR.out 2>> /home/matt/data/./mapping/log/GF_C25_IR.err
echo Starting GF_V01_IR
/home/matt/data/./mapping/scripts/GF_V01_IR.bash >> /home/matt/data/./mapping/log/GF_V01_IR.out 2>> /home/matt/data/./mapping/log/GF_V01_IR.err
echo Starting GF_C01_IR
/home/matt/data/./mapping/scripts/GF_C01_IR.bash >> /home/matt/data/./mapping/log/GF_C01_IR.out 2>> /home/matt/data/./mapping/log/GF_C01_IR.err
echo Starting mapping.merge
gedi -e MergeCIT -c /home/matt/data/./mapping/mapping.cit /home/matt/data/./mapping/V01__1_IR.cit /home/matt/data/./mapping/C01__1_IR.cit /home/matt/data/./mapping/V05__1_IR.cit /home/matt/data/./mapping/C05__1_IR.cit /home/matt/data/./mapping/V10__1_IR.cit /home/matt/data/./mapping/C10__1_IR.cit /home/matt/data/./mapping/V25__1_IR.cit /home/matt/data/./mapping/C25__1_IR.cit /home/matt/data/./mapping/C25__2.cit /home/matt/data/./mapping/C01__2.cit /home/matt/data/./mapping/V01__2.cit /home/matt/data/./mapping/C25__3.cit /home/matt/data/./mapping/C05__2.cit /home/matt/data/./mapping/V10__2.cit /home/matt/data/./mapping/C10__2.cit /home/matt/data/./mapping/C10__3.cit /home/matt/data/./mapping/V05__2.cit /home/matt/data/./mapping/C05__3.cit /home/matt/data/./mapping/V25__2.cit /home/matt/data/./mapping/V10__3.cit /home/matt/data/./mapping/V05__3.cit /home/matt/data/./mapping/C01__3.cit /home/matt/data/./mapping/V01__3.cit /home/matt/data/./mapping/V25__3.cit /home/matt/data/./mapping/GF_C05.cit /home/matt/data/./mapping/GF_V05.cit /home/matt/data/./mapping/I09_761.cit /home/matt/data/./mapping/I09_764.cit /home/matt/data/./mapping/U09__2.cit /home/matt/data/./mapping/U09__3.cit /home/matt/data/./mapping/I09_762.cit /home/matt/data/./mapping/I09_757.cit /home/matt/data/./mapping/GF_V10.cit /home/matt/data/./mapping/GF_C10.cit /home/matt/data/./mapping/I09inf_766.cit /home/matt/data/./mapping/I09uninf_766.cit /home/matt/data/./mapping/U09__1.cit /home/matt/data/./mapping/I09_766.cit /home/matt/data/./mapping/GF_V25_IRMS.cit /home/matt/data/./mapping/GF_V25_IR.cit /home/matt/data/./mapping/GF_C25_IR.cit /home/matt/data/./mapping/GF_V01_IR.cit /home/matt/data/./mapping/GF_C01_IR.cit >> /home/matt/data/./mapping/log/mapping.merge.out 2>> /home/matt/data/./mapping/log/mapping.merge.err
#!/bin/bash
echo Starting mapping.report
gedi -e Stats -prefix /home/matt/data/./mapping/report/mapping. -g m_musculus.ens101 ERCC92 /home/matt/data/./mapping/mapping.cit >> /home/matt/data/./mapping/log/mapping.report.out 2>> /home/matt/data/./mapping/log/mapping.report.err
mkdir -p /home/matt/data/./mapping/counts
echo Starting mapping.count
gedi -e Stats -prefix /home/matt/data/./mapping/counts/mapping. -g m_musculus.ens101 ERCC92 -count /home/matt/data/./mapping/mapping.cit >> /home/matt/data/./mapping/log/mapping.count.out 2>> /home/matt/data/./mapping/log/mapping.count.err
V01__1_IR.bash
#!/bin/bash
mkdir -p /tmp/V01__1_IR
cd /tmp/V01__1_IR
zcat /home/matt/data/raw/round0/round3/V01_1__3.fq.gz > V01__1_IR.fastq
echo -e "Category\tCount" > V01__1_IR.reads.tsv
echo -ne "All\t" >> V01__1_IR.reads.tsv
L=`wc -l V01__1_IR.fastq | head -n1 | awk '{print $1}'`
leftreads=$((L / 4))
echo $leftreads >> V01__1_IR.reads.tsv
gedi -t . -e FastqFilter -D -overwrite -ld V01__1_IR.readlengths.tsv -min 18 V01__1_IR.fastq
echo -ne "Trimmed\t" >> V01__1_IR.reads.tsv
L=`wc -l V01__1_IR.fastq | head -n1 | awk '{print $1}'`
leftreads=$((L / 4))
echo $leftreads >> V01__1_IR.reads.tsv
# mapping
STAR --runMode alignReads --runThreadN 8 --genomeDir /home/matt/data/mus_ERCC_index --limitBAMsortRAM 8000000000 --outFilterMismatchNmax 20 --outFilterScoreMinOverLread 0.4 --outFilterMatchNminOverLread 0.4 --readFilesIn V01__1_IR.fastq --outSAMmode NoQS --outSAMtype BAM SortedByCoordinate --alignEndsType EndToEnd --outSAMattributes nM MD NH
mv Aligned.sortedByCoord.out.bam V01__1_IR.bam
echo -ne "Unique\t" >> V01__1_IR.reads.tsv
grep "Uniquely mapped reads number" Log.final.out | cut -f2 -d'|' | awk '{ print $1}' >> V01__1_IR.reads.tsv
echo -ne "Multi\t" >> V01__1_IR.reads.tsv
grep "Number of reads mapped to multiple loci" Log.final.out | cut -f2 -d'|' | awk '{ print $1}' >> V01__1_IR.reads.tsv
mkdir -p /home/matt/data/./mapping/report
samtools index V01__1_IR.bam
echo -ne "Mitochondrial\t" >> V01__1_IR.reads.tsv
gedi -t . -e Bam2CIT V01__1_IR.cit V01__1_IR.bam | tail -n1 | awk '{ print $2 }'>> V01__1_IR.reads.tsv
gedi -t . -e CorrectCIT V01__1_IR.cit
echo -ne "CIT\t" >> V01__1_IR.reads.tsv
gedi -t . -e ReadCount -g m_musculus.ens101 ERCC92 -m Weight V01__1_IR.cit | tail -n1 | awk '{ print $2 }'>> V01__1_IR.reads.tsv
mv *.readlengths.* /home/matt/data/./mapping/report
mv V01__1_IR.reads.tsv /home/matt/data/./mapping/report
if [ -f V01__1_IR.cit ]; then
mv V01__1_IR.cit* /home/matt/data/./mapping
mkdir -p /home/matt/data/./mapping/bams
mv V01__1_IR.bam* /home/matt/data/./mapping/bams
cd /home/matt/data/./mapping
rm -rf /tmp/V01__1_IR
else
(>&2 echo "There were some errors, did not delete temp directory!")
cd /home/matt/data/./mapping
fi
Rscript report/V01__1_IR.reads.R
echo '{"plots":[{"section":"Mapping statistics","id":"mappingV01__1_IR","title":"V01__1_IR","description":"How many reads are removed by adapter trimming and rRNA mapping, how many are mapped to which reference and retained overall.","img":"V01__1_IR.reads.png","script":"V01__1_IR.reads.R","csv":"V01__1_IR.reads.tsv"}]}' > report/V01__1_IR.reads.report.json
Looks like the issue may be pre-STAR.
I've not come across gedi
before, but it sounds like their pipeline requires the read queries to be integers:
https://github.com/erhard-lab/price/issues/3
I'm not at all proficient in Java, but it from the source code, it appears the FastqFilter
tool 'reindexes reads', but can also extract UMIs (-umi
) and append these to the read names. If so, you will need to use gedi
to perform the UMI extraction in order for the resultant BAMs to be compatible with umi_tools dedup
https://github.com/erhard-lab/gedi/blob/cb99bb6622bb14be24dfa229409fd67d74d83657/GediPreprocess/src/executables/FastqFilter.java#L80
I can't find any documentation to confirm this though. Perhaps raise an issue here https://github.com/erhard-lab/gedi and link to this one.
Thank you very much!
Hi, I'm completely new to bioinformatical analysis, so I hope this is enough information to diagnose my problem.
I got the following error after running
umi_tools dedup -I input.bam --output-stats=deduplicated -S output.bam
Head of BAM file (
samtools view input.bam | head -n 6
):Head of fastq file:
Here my UMIs are ACGGCC, ATGGAA, and AAGAAC. I put the first 100 000 lines of the fastq file in a text document and couldn't find any UMIs of length 7. I also don't understand how UMIs are stored in BAM files.