fulcrumgenomics / fgbio

Tools for working with genomic and high throughput sequencing data.
http://fulcrumgenomics.github.io/fgbio/
MIT License
309 stars 67 forks source link

ZipperBams failed. Error: processed all unmapped reads but there are mapped reads remaining to be read. #978

Open teryyoung opened 5 months ago

teryyoung commented 5 months ago

hello! I'm doing with sequencing data with UMI, I followed #fgbio-best-practise-fastq---consensus-pipeline When I run fgbio ZipperBams to merge my consensus.ubam and consensus.mapped.bam, there is an exception: Exception in thread "main" java.lang.IllegalStateException: Error: processed all unmapped reads but there are mapped reads remaining to be read.

and I check my two bam files; they all are queryname sorted, have same number of reads, the difference between them is all of queryname of ubam have suffix "/A". like this: image image I found it is caused by fgbio CallConsensusReads...

I wanna know whether this is the problem caused ZipperBams failed, or how to deal with this problem(ZipperBams)?

my fgbio version is 2.2.2-3a74fd2-SNAPSHOT

nh13 commented 5 months ago

Somehow the /A is getting lost between the uBAM and the mapped BAM. I see your filename has picard in it, but we don't use picard in the best-practices pipeline. Can you provide the commands you used to go from the uBAM to the mapped BAM?

teryyoung commented 4 months ago

I used STAR to map the uBAM, so the code looks weird, 1)First I FastqtoBam the raw reads, and picard sorted it; java -Xmx10g -jar /share/apps/softwares/fgbio/target/scala-2.13/fgbio-2.2.2-3a74fd2-SNAPSHOT.jar FastqToBam \ --input $r1 $r2 \ --read-structures 5M2S+T 5M2S+T \ --sample ${sample} \ --library ${sample} \ --platform-unit BGIT7 \ --output ./0.1.uBAM/${sample}.uBAM java -Xmx10G -jar /share/apps/softwares/picard-2.26.2/picard.jar SortSam I=0.1.uBAM/${sample}.uBAM \ O=0.1.uBAM/${sample}.picard.sorted.uBAM\ SORT_ORDER=queryname

2)then I extract fastq from uBAM, mapped them using STAR, forgive me that I didn't find ways to mapping one fastq file using STAR. samtools fastq 0.1.uBAM/${sample}.picard.sorted.uBAM -1 tmp/${sample}.1.fq.bak -2 tmp/${sample}.2.fq.bak

awk 'NR%4==1 {print $0"/1"} NR%4 !=1 {print $0}' tmp/${sample}.1.fq.bak >tmp/${sample}.uBAM_1.fq awk 'NR%4==1 {print $0"/2"} NR%4 !=1 {print $0}' tmp/${sample}.2.fq.bak >tmp/${sample}.uBAM_2.fq

ref_dir="~/reference_genome/GRC38_p14/hg38_ucsc/STAR_index/" rm tmp/${sample}/ -rf /share/apps/softwares/STAR-2.7.10b/bin/Linux_x86_64_static/STAR --twopassMode Basic\ --quantMode TranscriptomeSAM GeneCounts\ --runThreadN 6\ --genomeDir ${ref_dir}\ --alignIntronMin 20\ --alignIntronMax 50000\ --outSAMtype BAM Unsorted\ --outSAMattrRGline ID:"${sample}" SM:"${sample}" PL:BGI\ --outFilterMismatchNmax 2\ --outSJfilterReads Unique --outSAMmultNmax 1\ --outFileNamePrefix ./0.1.uBAM/${sample}/${sample}.\ --outSAMmapqUnique 60\ --outSAMunmapped Within KeepPairs\ --readFilesCommand cat\ --readFilesIn tmp/${sample}.uBAM_1.fq tmp/${sample}.uBAM_2.fq\ --outTmpDir ./tmp/${sample} 3)After first mapping, I sorted result bam file using picard, and ZipperBams; java -Xmx10G -jar /share/apps/softwares/picard-2.26.2/picard.jar SortSam I=0.1.uBAM/${sample}/${sample}.Aligned.out.bam\ O=0.1.uBAM/${sample}/${sample}.mapped.picard.sorted.BAM\ SORT_ORDER=queryname java -jar /share/apps/softwares/fgbio/target/scala-2.13/fgbio-2.2.2-3a74fd2-SNAPSHOT.jar ZipperBams \ --unmapped 0.1.uBAM/${sample}.picard.sorted.uBAM \ --input 0.1.uBAM/${sample}/${sample}.mapped.picard.sorted.BAM \ --ref ${ref}/../hg38.p14.fa \ --output 0.1.uBAM/${sample}.umi.mapped.bam

All commands above runs fine, but After I GroupReadsByUmi, CallMolecularConsensusReads, FilterConsensusReads, and mapping them sencond time, ZipperBams have problem; 4)java -jar /share/apps/softwares/fgbio/target/scala-2.13/fgbio-2.2.2-3a74fd2-SNAPSHOT.jar GroupReadsByUmi \ --input=${dir01}/1.umi.mapped.bam \ --output=${dir02}/1.umi.group.BAM \ --strategy=paired \ --min-map-q=10 \ --edits=1 \ --raw-tag=RX \ --family-size-histogram=${dir02}/1.umi.tag_family_histogram.txt`

java -jar /share/apps/softwares/fgbio/target/scala-2.13/fgbio-2.2.2-3a74fd2-SNAPSHOT.jar CallMolecularConsensusReads \ --min-reads=1 \ --min-input-base-quality=20 \ --input=${dir02}/1.umi.group.BAM \ --output=${dir03}/1.consensus.uBAM \ --threads=4 \ --read-group-id=false

java -jar /share/apps/softwares/fgbio/target/scala-2.13/fgbio-2.2.2-3a74fd2-SNAPSHOT.jar FilterConsensusReads \ --input=${dir03}/1.consensus.uBAM \ --output=${dir03}/1.consensus.filter.uBAM \ --ref="${ref}/../hg38.p14.fa" --min-reads=3 \ --max-read-error-rate=0.2 \ --max-base-error-rate=0.2 \ --min-base-quality=30 \ --max-no-call-fraction=0.20

java -Xmx10G -jar /share/apps/softwares/picard-2.26.2/picard.jar SortSam \ I=${dir03}/1.consensus.filter.uBAM\ O=${dir03}/1.consensus.filter.sorted.uBAM\ SORT_ORDER=queryname

samtools fastq ./${dir03}/1.consensus.filter.sorted.uBAM -1 tmp/1.consensus.uBAM_1.fq.bak -2 tmp/1.consensus.uBAM_2.fq.bak awk 'NR%4 ==1 {print $0"/1"} NR%4 !=1 {print$0}' tmp/1.consensus.uBAM_1.fq.bak >tmp/1.consensus.uBAM_1.fq awk 'NR%4==1 {print $0"/2"} NR%4!=1 {print$0}' tmp/1.consensus.uBAM_2.fq.bak >tmp/1.consensus.uBAM_2.fq

STAR alignment code is like it above, and then picard sorted; then ZipperBams; java -Xmx10G -jar /share/apps/softwares/fgbio/target/scala-2.13/fgbio-2.2.2-3a74fd2-SNAPSHOT.jar ZipperBams \ --unmapped ./${dir03}/1.consensus.filter.sorted.uBAM \ --ref ${ref}/../hg38.p14.fa \ --input ${dir03}/1.consensus.mapped.picard.sorted.bam \ --output ${dir03}/1.consensus.mapped.zipper.bam

Tips: using picard sortBam is because I used to use picard mergeBam to merge unmapped and mapped reads; I know the sort_order need to be the same when using picard mergeBam; (why I didnt use samtools sort is that some BAM files didnt support samtools's sort somehow.)

teryyoung commented 4 months ago

Did you mean that '/A' is necessary for ZipperBams? And that should be retain whatever I run command on it? now I think my fault is misuse the STAR alignment.