Cell barcode too short error: trimming ?

colindaven commented 6 years ago

Hi,

this looks to me like I am trying to pick up a cell barcode of 12bp, but the read is shorter than 12bp (possibly due to quality trimming?). At least I see very short reads in the unmapped.bam file.

Does it make sense for drop-seq-pipe to check the length of the sequence to be tagged and avoid any errors with a try catch block, and further to exclude these sequences ?

We're talking a NextSEq 2x75bp run with R1 of 18-20bp and R2 of 60-62 bp.

Of course, I may have interpreted this error wrongly, I'm new to this.

cheers

+ java -Xmx4g -Djava.io.tmpdir=/lager2/rcug/2018/6E77/fastq2/tobias/tmp -jar /lager2/rcug/2018/6E77/fastq2/tobias/Drop-seq_tools-1.13/jar/dropseq.jar TagBamWithReadSequenceExtended SUMMARY=logs/PE_03_S4_CELL_barcode.txt BASE_RANGE=1-12 BASE_QUALITY=25 BARCODED_READ=1 DISCARD_READ=false TAG_NAME=XC NUM_BASES_BELOW_QUALITY=1 INPUT=data/PE_03_S4_unaligned.bam OUTPUT=data/PE_03_S4_BC_tagged_unmapped.bam
Picked up _JAVA_OPTIONS: -Dhttp.proxyHost=172.24.2.50 -Dhttp.proxyPort=8080 -Dhttps.proxyHost=172.24.2.50 -Dhttps.proxyPort=8080
[Thu May 31 17:40:42 CEST 2018] org.broadinstitute.dropseqrna.utils.TagBamWithReadSequenceExtended INPUT=data/PE_03_S4_unaligned.bam OUTPUT=data/PE_03_S4_BC_tagged_unmapped.bam SUMMARY=logs/PE_03_S4_CELL_barcode.txt BASE_RANGE=1-12 BARCODED_READ=1 DISCARD_READ=false BASE_QUALITY=25 NUM_BASES_BELOW_QUALITY=1 TAG_NAME=XC    TAG_BARCODED_READ=false HARD_CLIP_BASES=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Thu May 31 17:40:42 CEST 2018] Executing as rcug@hpc-rc03 on Linux 4.4.0-109-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_121-b15; Picard version: 1.13(7bed8f4_1513008033)
[Thu May 31 17:40:42 CEST 2018] org.broadinstitute.dropseqrna.utils.TagBamWithReadSequenceExtended done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=2058354688
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 11
        at org.broadinstitute.dropseqrna.utils.TagBamWithReadSequenceExtended.scoreBaseQuality(TagBamWithReadSequenceExtended.java:251)
        at org.broadinstitute.dropseqrna.utils.TagBamWithReadSequenceExtended.processReadPair(TagBamWithReadSequenceExtended.java:220)
        at org.broadinstitute.dropseqrna.utils.TagBamWithReadSequenceExtended.doWork(TagBamWithReadSequenceExtended.java:154)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205)
        at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
        at org.broadinstitute.dropseqrna.cmdline.DropSeqMain.main(DropSeqMain.java:42)
    Error in rule BC_tags:
        jobid: 94
        output: data/PE_03_S4_BC_tagged_unmapped.bam, logs/PE_03_S4_CELL_barcode.txt

Hoohm commented 6 years ago

Hello @colindaven Your R1 should be consistent in terms of length. It's either full (meaning 20?) or nothing. The reason for that is the way cell barcode and UMI are captured.

Basically, the index you give in the config file will be used to get the cell and UMI barcode. If part of the cell + UMI is missing, it will crash.

I would delete all pair of reads that have a non full length R1. You could also not filter at all since this is taken care of later in the pipeline for R2.

Do you need any other information?

colindaven commented 6 years ago

Hi @Hoohm

that's really helpful. Thanks. I have lots of other seq lengths in my R1 actually, from 1-20 bp.

For others having this problem, here's a solution to filter by length then repair the pairs:

filter to length 20bp seqtk seq -L 20 A1_03_S2_R1.fastq.gz > A1_03_S2_2_R1.fastq &

repair the read pairs. repair.sh - from bbmap

repair.sh -Xmx400g in=A1_03_S2_2_R1.fastq in2=A1_03_S2_2_R2.fastq out1=A1_03_S2_2b_R1.fastq out2=A1_03_S2_2b_R2.fastq outs=singletons1.fq overwrite=true

Hoohm / dropSeqPipe

Cell barcode too short error: trimming ? #35