broadinstitute / Drop-seq

Java tools for analyzing Drop-seq data
MIT License
119 stars 34 forks source link

polyAtrimming NullPointerException #217

Closed Justgitacc closed 3 years ago

Justgitacc commented 3 years ago


I've recently ran into this issue listed below while attempting to run polyAtrimmer per the cookbook flow through : org.broadinstitute.dropseqrna.readtrimming.PolyATrimmer done. Elapsed time: 0.00 minutes. Exception in thread "main" java.lang.NullPointerException

with file size generated 0kb. I've checked recent posts about such issue, and I've made sure that XC and XM tags are tagged and present : SRR11862674.1 77 0 0 0 0 GTCGGNTGAACCGGAGATCT AAAAA#EEEEEEEEEEEEEE RG:Z:A SRR11862674.10 77 0 0 0 0 CATGTNGTGTCAGAGCCGAC AAAAA#EEEEEEEAEEEEEE RG:Z:A SRR11862674.100 77 0 0 0 0 GTGTCGGCTTTGCCTGAGAA AAAAAEEEEEEEEEEEEEEE RG:Z:A SRR11862674.100 141 0 0 0 0 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN ################################################################ XC:Z:GTGTCGGCTTTG RG:Z:A XM:Z:CCTGAGAA SRR11862674.1000 77 0 0 0 0 ACAAAGGCTTGGGTTTGACT /AAAAEEEEEEEEEEEEAEERG:Z:A SRR11862674.100000 141 0 0 0 0 GCCTTCCTTCTTCTCTATCGTATCCTCTGGCATCCATGAAACTACATTCAATTGGATGATGAAA /AAA666A//6//6///A///////////6/A/<A/</6A/6</A/6<///6<///66/6//// XC:Z:CAGGGCTCACAT RG:Z:A XM:Z:TTTTGAAC

Prior to this I've also mistakenly processed the dropseq sequenced file Read1 and Read2 separately prior(all the way from original splitted fasta via fastq-dump --splitt to DGE), and it seems to have reasonable counts. But I figured that was wrong after referring to the cookbook more closely, so now I've combined Read1(barcode read) and Read2(biological read) via FastqtoSam prior to the alignment pipeline.

So I have a few question in regard to this issue:

  1. Am I interpreting the procedure correctly ? dropseq sequences when downloading are 2 separate reads and should be combined via fastqtosam prior to following the alignment cookbook ? OR should the 2 reads from a single dropseq sequence be processed separately until STAR alignment and merged after ??

  2. If they should be combined via FastqtoSam and processed through the alignment pipeline together. What is causing the error in polyAtrimming ?? (I've also tried running polyAtrimming without USE_NEW_TRIMMER=true, which works but causes another error when converting SAMtoFastq in the subsequent step)

I apologize for the extended questions, and I greatly appreciate any information.

jamesnemesh commented 3 years ago

When you run processing data, the following happens:

1) You start off with paired reads (2 reads that have the same name). If you started with fastq files, then you’d combine them into a single unmapped BAM via FastqToSam. 2) You extract the cell and molecular barcode from the shorter read that contains that information, and put it on the other read as tags XC and XM 3) As part of extracting the second tag, you remove the shorter read from the BAM. The library is now unpaired reads - 1 read per read name.

I’d double check all of this is true first by looking at your BAM after each step.

If all of that looks right to you, extract the header + the first 10 lines of the BAM and try the polyA trimmer. If it still null pointers, attach that file so we can take a look.


On Nov 3, 2020, at 9:54 AM, Justgitacc wrote:


I've recently ran into this issue listed below while attempting to run polyAtrimmer per the cookbook flow through : org.broadinstitute.dropseqrna.readtrimming.PolyATrimmer done. Elapsed time: 0.00 minutes. Exception in thread "main" java.lang.NullPointerException

with file size generated 0kb. I've checked recent posts about such issue, and I've made sure that XC and XM tags are tagged and present : SRR11862674.1 77 0 0 0 0 GTCGGNTGAACCGGAGATCT AAAAA#EEEEEEEEEEEEEE RG:Z:A SRR11862674.10 77 0 0 0 0 CATGTNGTGTCAGAGCCGAC AAAAA#EEEEEEEAEEEEEE RG:Z:A SRR11862674.100 77 0 0 0 0 GTGTCGGCTTTGCCTGAGAA AAAAAEEEEEEEEEEEEEEE RG:Z:A SRR11862674.100 141 0 0 0 0 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN ################################################################ XC:Z:GTGTCGGCTTTG RG:Z:A XM:Z:CCTGAGAA SRR11862674.1000 77 0 0 0 0 ACAAAGGCTTGGGTTTGACT /AAAAEEEEEEEEEEEEAEERG:Z:A SRR11862674.100000 141 0 0 0 0 GCCTTCCTTCTTCTCTATCGTATCCTCTGGCATCCATGAAACTACATTCAATTGGATGATGAAA /AAA666A//6//6///A///////////6/A/<A/</6A/6</A/6<///6<///66/6//// XC:Z:CAGGGCTCACAT RG:Z:A XM:Z:TTTTGAAC

Prior to this I've also mistakenly processed the dropseq sequenced file Read1 and Read2 separately prior(all the way from original splitted fasta via fastq-dump --splitt to DGE), and it seems to have reasonable counts. But I figured that was wrong after referring to the cookbook more closely, so now I've combined Read1(barcode read) and Read2(biological read) via FastqtoSam prior to the alignment pipeline.

So I have a few question in regard to this issue:

Am I interpreting the procedure correctly ? dropseq sequences when downloading are 2 separate reads and should be combined via fastqtosam prior to following the alignment cookbook ? OR should the 2 reads from a single dropseq sequence be processed separately until STAR alignment and merged after ??

If they should be combined via FastqtoSam and processed through the alignment pipeline together. What is causing the error in polyAtrimming ?? (I've also tried running polyAtrimming without USE_NEW_TRIMMER=true, which works but causes another error when converting SAMtoFastq in the subsequent step)

I apologize for the extended questions, and I greatly appreciate any information.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe

Justgitacc commented 3 years ago

Hi James,

Thanks for the flow through explanation, and that is indeed how I processed my data so far using the commands below.

Combine java -jar resources/picard.jar FastqToSam -F1 read1.fastq -F2 read2.fastq -O read.bam -SO queryname -SM MOE

Tagging java -jar resources/dropseq.jar TagBamWithReadSequenceExtended INPUT= read.bam OUTPUT=read_Cell.bam SUMMARY=read_Cellular.bam_summary.txt BASE_RANGE=1-12 BASE_QUALITY=10 BARCODED_READ=1 DISCARD_READ=False TAG_NAME=XC NUM_BASES_BELOW_QUALITY=1 ; java -jar resources/dropseq.jar TagBamWithReadSequenceExtended INPUT= read_Cell.bam OUTPUT=read_Cell_Molecular.bam SUMMARY=read_Molecular.bam_summary.txt BASE_RANGE=13-20 BASE_QUALITY=10 BARCODED_READ=1 DISCARD_READ=False TAG_NAME=XM NUM_BASES_BELOW_QUALITY=1 ;

Trimming java -jar resources/dropseq.jar FilterBam TAG_REJECT=XQ INPUT=read_Cell_Molecular.bam OUTPUT=read_filtered.bam ; java -jar resources/dropseq.jar TrimStartingSequence INPUT=read_filtered.bam OUTPUT=read_trimmed_smart.bam OUTPUT_SUMMARY=read_trimming_report.txt SEQUENCE=AAGCAGTGGTATCAACGCAGAGTGAATGGG MISMATCHES=0 NUM_BASES=5 ;

I've also tried per your suggestions. I extracted the header and first 10/100 lines of the bam file and ran polyA trimmer samtools view -b read_trimmed_smart.bam | head -n 10 > test.bam samtools view -b read_trimmed_smart.bam | head -n 100 > test100.bam

While the test.bam ran through polyATrim without an error, the test100.bam with the first 100 lines got the same nullpointer error. And the .bam files are not supported by github to be attached.

jamesnemesh commented 3 years ago

Looking at your calls, it seems like there’s an error. The cookbook says:

Example Molecular Barcode:

TagBamWithReadSequenceExtended INPUT=unaligned_tagged_Cell.bam OUTPUT=unaligned_tagged_CellMolecular.bam SUMMARY=unaligned_tagged_Molecular.bam_summary.txt BASE_RANGE=13-20


java -jar resources/dropseq.jar TagBamWithReadSequenceExtended INPUT= read_Cell.bam OUTPUT=read_Cell_Molecular.bam SUMMARY=read_Molecular.bam_summary.txt BASE_RANGE=13-20 BASE_QUALITY=10 BARCODED_READ=1 DISCARD_READ=False TAG_NAME=XM NUM_BASES_BELOW_QUALITY=1 ;

Notice DISCARD_READ=TRUE. You have false, so your reads are still paired when you run the trimming step. Seems like an easy enough fix.


Justgitacc commented 3 years ago

Yes, I caught that error earlier and started the pipeline again. It worked all the way ! Thank you so much