Hoohm / dropSeqPipe

A SingleCell RNASeq pre-processing snakemake workflow
Creative Commons Attribution Share Alike 4.0 International
147 stars 47 forks source link

TagBamWithReadSequenceExtended: ArrayIndexOutOfBoundsException: 10 #22

Closed jenzopr closed 6 years ago

jenzopr commented 6 years ago

Hi Patrick,

we finally have our first dropSeq data sequenced and I tried to run in through the dropSeqPipe. Unfortunately, TagBamWithReadSequenceExtended terminates after a short while giving:

(dropSeqPipe) jens@KI-V0205:/data/dropTest/dropSeqPipe$ drop-seq-tools-wrapper.sh -m 60g -t ./tmp/ -p TagBamWithReadSequenceExtended SUMMARY=logs/test_CELL_barcode.txt BASE_RANGE=1-12 BASE_QUALITY=30 BARCODED_READ=1 DISCARD_READ=false TAG_NAME=XC NUM_BASES_BELOW_QUALITY=1 INPUT=data/test_unaligned.bam OUTPUT=data/test_BC_tagged_unmapped.bam
+ java -Xmx60g -Djava.io.tmpdir=./tmp/ -jar /mnt/software/x86_64/packages/dropSeqTools/1.13/jar/dropseq.jar TagBamWithReadSequenceExtended SUMMARY=logs/test_CELL_barcode.txt BASE_RANGE=1-12 BASE_QUALITY=30 BARCODED_READ=1 DISCARD_READ=false TAG_NAME=XC NUM_BASES_BELOW_QUALITY=1 INPUT=data/test_unaligned.bam OUTPUT=data/test_BC_tagged_unmapped.bam
[Tue Feb 27 11:17:01 CET 2018] org.broadinstitute.dropseqrna.utils.TagBamWithReadSequenceExtended INPUT=data/test_unaligned.bam OUTPUT=data/test_BC_tagged_unmapped.bam SUMMARY=logs/test_CELL_barcode.txt BASE_RANGE=1-12 BARCODED_READ=1 DISCARD_READ=false BASE_QUALITY=30 NUM_BASES_BELOW_QUALITY=1 TAG_NAME=XC    TAG_BARCODED_READ=false HARD_CLIP_BASES=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Tue Feb 27 11:17:01 CET 2018] Executing as jens@KI-V0205 on Linux 3.16.0-5-amd64 amd64; OpenJDK 64-Bit Server VM 1.8.0_121-b15; Picard version: 1.13(7bed8f4_1513008033)
[Tue Feb 27 11:17:05 CET 2018] org.broadinstitute.dropseqrna.utils.TagBamWithReadSequenceExtended done. Elapsed time: 0.07 minutes.
Runtime.totalMemory()=1015021568
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 10
    at org.broadinstitute.dropseqrna.utils.TagBamWithReadSequenceExtended.scoreBaseQuality(TagBamWithReadSequenceExtended.java:251)
    at org.broadinstitute.dropseqrna.utils.TagBamWithReadSequenceExtended.processReadPair(TagBamWithReadSequenceExtended.java:220)
    at org.broadinstitute.dropseqrna.utils.TagBamWithReadSequenceExtended.doWork(TagBamWithReadSequenceExtended.java:154)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
    at org.broadinstitute.dropseqrna.cmdline.DropSeqMain.main(DropSeqMain.java:42)

I assume that my fastq/bam is at some place malformed (e.g. an R1 read is too short), although I can't imagine. Is there any debug flag I can place to get more hints about what fails? Or can you quickly dig into TagBamWithReadSequenceExtended.java at line 251 and tell me whats happening there?

Best and thanks! Jens

Edit: I can share the data/test_unaligned.bam file, if needed!

Hoohm commented 6 years ago

Can you check the index values of your cell and UMI barcodes?

Your assumption is right, there is a problem with read1. I'm not sure what though.

how long is your read1? Are they all the same length? Did you run a trimming algo on read1?

Maybe check out your fastqc on read1.

The code is in the public folder in dropseqtools

private int scoreBaseQuality(final SAMRecord barcodedRead, final List baseRanges) { int numBasesBelowQuality=0; byte [] qual= barcodedRead.getBaseQualities(); char [] seq = barcodedRead.getReadString().toUpperCase().toCharArray(); for (BaseRange b: baseRanges) for (int i=b.getStart()-1; i<b.getEnd(); i++) { line 251---->>> byte q = qual[i]; char s = seq[i];

            if (q < this.BASE_QUALITY || s=='N')
                numBasesBelowQuality++;
        }

From what I understand, qual[i] is out of range, meaning maybe the quality line is missing some values.

Could you send a few lines of your read1?

2018-02-27 11:23 GMT+01:00 Jens Preußner notifications@github.com:

Hi Patrick,

we finally have our first dropSeq data sequenced and I tried to run in through the dropSeqPipe. Unfortunately, TagBamWithReadSequenceExtended terminates after a short while giving:

(dropSeqPipe) jens@KI-V0205:/data/dropTest/dropSeqPipe$ drop-seq-tools-wrapper.sh -m 60g -t ./tmp/ -p TagBamWithReadSequenceExtended SUMMARY=logs/test_CELL_barcode.txt BASE_RANGE=1-12 BASE_QUALITY=30 BARCODED_READ=1 DISCARD_READ=false TAG_NAME=XC NUM_BASES_BELOW_QUALITY=1 INPUT=data/test_unaligned.bam OUTPUT=data/test_BC_tagged_unmapped.bam

  • java -Xmx60g -Djava.io.tmpdir=./tmp/ -jar /mnt/software/x86_64/packages/dropSeqTools/1.13/jar/dropseq.jar TagBamWithReadSequenceExtended SUMMARY=logs/test_CELL_barcode.txt BASE_RANGE=1-12 BASE_QUALITY=30 BARCODED_READ=1 DISCARD_READ=false TAG_NAME=XC NUM_BASES_BELOW_QUALITY=1 INPUT=data/test_unaligned.bam OUTPUT=data/test_BC_tagged_unmapped.bam [Tue Feb 27 11:17:01 CET 2018] org.broadinstitute.dropseqrna.utils.TagBamWithReadSequenceExtended INPUT=data/test_unaligned.bam OUTPUT=data/test_BC_tagged_unmapped.bam SUMMARY=logs/test_CELL_barcode.txt BASE_RANGE=1-12 BARCODED_READ=1 DISCARD_READ=false BASE_QUALITY=30 NUM_BASES_BELOW_QUALITY=1 TAG_NAME=XC TAG_BARCODED_READ=false HARD_CLIP_BASES=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json [Tue Feb 27 11:17:01 CET 2018] Executing as jens@KI-V0205 on Linux 3.16.0-5-amd64 amd64; OpenJDK 64-Bit Server VM 1.8.0_121-b15; Picard version: 1.13(7bed8f4_1513008033) [Tue Feb 27 11:17:05 CET 2018] org.broadinstitute.dropseqrna.utils.TagBamWithReadSequenceExtended done. Elapsed time: 0.07 minutes. Runtime.totalMemory()=1015021568 Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 10 at org.broadinstitute.dropseqrna.utils.TagBamWithReadSequenceExtended.scoreBaseQuality(TagBamWithReadSequenceExtended.java:251) at org.broadinstitute.dropseqrna.utils.TagBamWithReadSequenceExtended.processReadPair(TagBamWithReadSequenceExtended.java:220) at org.broadinstitute.dropseqrna.utils.TagBamWithReadSequenceExtended.doWork(TagBamWithReadSequenceExtended.java:154) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94) at org.broadinstitute.dropseqrna.cmdline.DropSeqMain.main(DropSeqMain.java:42)

I assume that my fastq/bam is at some place malformed (e.g. an R1 read is too short), although I can't imagine. Is there any debug flag I can place to get more hints about what fails? Or can you quickly dig into TagBamWithReadSequenceExtended.java at line 251 and tell me whats happening there?

Best and thanks! Jens

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Hoohm/dropSeqPipe/issues/22, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNXaOliRCTsjxPThdXg6mgAqRM0h_HAks5tY9eJgaJpZM4SUr4s .

--

Roelli Patrick Division of Animal Physiology and Immunology TUM School of Life Sciences Weihenstephan Technische Universität München Weihenstephaner Berg 3 85354 Freising GermanyBCF, Swiss Institute of BioinformaticsBâtiment Génopode, Quartier SorgeUniversity of Lausanne1015 LausanneSwitzerland

https://github.com/Hoohm https://github.com/Hoohm

jenzopr commented 6 years ago

We identified bcl2fastq to be causing problems when giving a Nextera adapter sequence to trim away. It will also trim R1 reads, so our R1 read length distribution is highly variable:

      1 10
     11 13
     69 14
      9 15
     24 16
     48 17
    111 18
  62328 19
 187397 20
      1 8

Filtering reads shorter than 20 bp before putting it into dropSeqPipe helped. Thanks for your help!

Hoohm commented 6 years ago

Sure no problem. Quick question, once you filtered out the reads from R1 with less than 20bp, how do you reconcile it with R2?

Yichel518 commented 5 years ago

Hello,I met a block in my analysis of dropSeq data when I used TagBamWithReadSequenceExtended, and my tools is Drop-seq_tools-2.2.0,and my command is "tuyx@ubuntu:~/Drop_Seq$ /home/tuyx/Drop_Seq/Drop-seq_tools-2.2.0/TagBamWithReadSequenceExtended INPUT=/home/tuyx/Drop_Seq/3-somite_Rep_1.ubam OUTPUT=3-somite_Rep_1_unaligned_tagged_Cell.bam SUMMARY=3-somite_Rep_1_unaligned_tagged_Cell.bam_summary.txt BASE_RANGE=1-12 BASE_QUALITY=10 BARCODED_READ=1 DISCARD_READ=False TAG_NAME=XC NUM_BASES_BELOW_QUALITY=1" and my error is" image 15:12:52.051 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/tuyx/Drop_Seq/Drop-seq_tools-2.2.0/jar/lib/picard-2.18.14.jar!/com/intel/gkl/native/libgkl_compression.so 15:12:52.084 WARN NativeLibraryLoader - Unable to load libgkl_compression.so from native/libgkl_compression.so (No such file or directory) 15:12:52.085 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/tuyx/Drop_Seq/Drop-seq_tools-2.2.0/jar/lib/picard-2.18.14.jar!/com/intel/gkl/native/libgkl_compression.so 15:12:52.086 WARN NativeLibraryLoader - Unable to load libgkl_compression.so from native/libgkl_compression.so (No such file or directory) [Tue Jul 02 15:12:52 CST 2019] TagBamWithReadSequenceExtended INPUT=/home/tuyx/Drop_Seq/3-somite_Rep_1.ubam OUTPUT=3-somite_Rep_1_unaligned_tagged_Cell.bam SUMMARY=3-somite_Rep_1_unaligned_tagged_Cell.bam_summary.txt BASE_RANGE=1-12 BARCODED_READ=1 DISCARD_READ=false BASE_QUALITY=10 NUM_BASES_BELOW_QUALITY=1 TAG_NAME=XC TAG_BARCODED_READ=false HARD_CLIP_BASES=false TAG_QUALITY=XQ VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false [Tue Jul 02 15:12:52 CST 2019] Executing as tuyx@ubuntu on Linux 4.4.0-148-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_181-8u181-b13-1ubuntu0.16.04.1-b13; Deflater: Jdk; Inflater: Jdk; Provider GCS is not available; Picard version: 2.2.0(4709ee9_1553016809) [Tue Jul 02 15:12:52 CST 2019] org.broadinstitute.dropseqrna.utils.TagBamWithReadSequenceExtended done. Elapsed time: 0.00 minutes. Runtime.totalMemory()=1011351552 Exception in thread "main" htsjdk.samtools.SAMException: Cannot read non-existent file: file:///home/tuyx/Drop_Seq/3-somite_Rep_1.ubam at htsjdk.samtools.util.IOUtil.assertFileIsReadable(IOUtil.java:430) at htsjdk.samtools.util.IOUtil.assertFileIsReadable(IOUtil.java:417) at org.broadinstitute.dropseqrna.utils.TagBamWithReadSequenceExtended.doWork(TagBamWithReadSequenceExtended.java:100) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103) at org.broadinstitute.dropseqrna.cmdline.DropSeqMain.main(DropSeqMain.java:42) ". Can you help me ?

Hoohm commented 5 years ago

What's a ubam file??