fulcrumgenomics / fgbio

Tools for working with genomic and high throughput sequencing data.
http://fulcrumgenomics.github.io/fgbio/
MIT License
309 stars 67 forks source link

{readname} did not have a primary R1 record. (GroupReadsByUmi) #995

Open xiechangxiao opened 2 months ago

xiechangxiao commented 2 months ago

when I use GroupReadsByUmi group reads, this error occurs.@nh13 cmd: java -jar fgbio-2.2.1.jar --tmp-dir=/tmp GroupReadsByUmi --input=test.bam --output=test.umi.group.BAM --strategy=paired --min-map-q=20 --edits=1 --raw-tag=RX --family-size-histogram=test.family.size.histogram

error message: Exception in thread "main" java.lang.IllegalStateException: {readname} did not have a primary R1 record

nh13 commented 2 months ago

Can you tell us how you created the test bam, and the provide a test case (BAM) to debug? Did you sort the reads prior?

xiechangxiao commented 2 months ago

First, mapping with BWA: bwa mem -M -K 10000000 -R "@RG\tID:1\tPL:Illumina\tLB:hg19\tSM:tumor" hg19.fasta Tumor_1.clean.fq.gz Tumor_2.clean.fq.gz |samtools view -b -o Tumor.raw.bam - && samtools sort Tumor.raw.bam -o Tumor.sort.bam -O bam Then, extract MQ and RX: samtools view -h Tumor.sort.bam | mawk '/^@/ {print;next} {N=split($1,n,":");gsub("_", "-", n[N]);print $0 "\tMQ:i:" $5 "\tRX:Z:" n[N]}' |samtools view -b - >test.bam Finally, use GroupReadsByUmi and get errors: Exception in thread "main" java.lang.IllegalStateException: E250006338L1C038R0044055987:GTGAG_GCACG did not have a primary R1 record. But, the flag value of this paired read is 113 and 177.@nh13

nh13 commented 2 months ago

Can you confirm there is a R1 primary record in your test.bam "E250006338L1C038R0044055987:GTGAG_GCACG"? Either your RX/MQ extraction isn't working as intended, or your sort order into GroupReadsByUmi is wrong (try samtools sort --template-coordinate using the latest samtools). Having test.bam would be helpful to debug futher.

To extract the UMI bases, try FastqToBam or other tools in fgbio, which we support, and are used in our best-practices pipeline:

https://github.com/fulcrumgenomics/fgbio/blob/main/docs/best-practice-consensus-pipeline.md

xiechangxiao commented 2 months ago

I provide a simple bam file for you debug.@nh13 test.zip If you remove "E250006338L1C038R0044055987:GTGAG_GCACG" from this bam file, GroupReadsByUmi will work properly. By the way, I've followed your best practices, but it's time consuming and space consuming. Personally, I think there could be easier way to build a consensus pipeline. For example:

  1. shift UMI to readname with fastp
  2. mapping with BWA
  3. GroupReadsByUmi and CallMolecularConsensusReads