broadinstitute / Drop-seq

Java tools for analyzing Drop-seq data
MIT License
119 stars 34 forks source link

TagBamWithReadSequenceExtended:Base [10] was requested, but the read isn't long enough [CCCCAAATT] #134

Closed Yichel518 closed 5 years ago

Yichel518 commented 5 years ago

Hello, When I run Drop-seq_tools-2.3.0/TagBamWithReadSequenceExtended \ INPUT="/home/yxtu/result/ubam/shiels_Rep_1.ubam" OUTPUT=/home/yxtu/result/taggedbam//shiels_Rep_1_unaligned_tagged_Cell.bam \ SUMMARY=/home/yxtu/result/taggedbam//shiels_Rep_1_unaligned_tagged_Cell.bam_summary.txt \ BASE_RANGE=1-12 BASE_QUALITY=10 BARCODED_READ=1 DISCARD_READ=False TAG_NAME=XC \ NUM_BASES_BELOW_QUALITY=1 There was warn and error:

INFO 2019-07-03 08:48:48 TagBamWithReadSequenceExtended

** NOTE: Picard's command line syntax is changing.


** For more information, please see: ** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)


** The command line looks like this in the new syntax:


** TagBamWithReadSequenceExtended -INPUT /home/yxtu/result/ubam/shiels_Rep_1.ubam -OUTPUT /home/yxtu/result/taggedbam//shiels_Rep_1_unaligned_tagged_Cell.bam -SUMMARY /home/yxtu/result/taggedbam//shiels_Rep_1_unaligned_tagged_Cell.bam_summary.txt -BASE_RANGE 1-12 -BASE_QUALITY 10 -BARCODED_READ 1 -DISCARD_READ False -TAG_NAME XC -NUM_BASES_BELOW_QUALITY 1


08:48:48.396 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/yxtu/Drop-seq_tools-2.3.0/jar/lib/picard-2.18.14.jar!/com/intel/gkl/native/libgkl_compression.so 08:48:48.403 WARN NativeLibraryLoader - Unable to load libgkl_compression.so from native/libgkl_compression.so (No such file or directory) 08:48:48.404 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/yxtu/Drop-seq_tools-2.3.0/jar/lib/picard-2.18.14.jar!/com/intel/gkl/native/libgkl_compression.so 08:48:48.404 WARN NativeLibraryLoader - Unable to load libgkl_compression.so from native/libgkl_compression.so (No such file or directory) [Wed Jul 03 08:48:48 CST 2019] TagBamWithReadSequenceExtended INPUT=/home/yxtu/result/ubam/shiels_Rep_1.ubam OUTPUT=/home/yxtu/result/taggedbam/shiels_Rep_1_unaligned_tagged_Cell.bam SUMMARY=/home/yxtu/result/taggedbam/shiels_Rep_1_unaligned_tagged_Cell.bam_summary.txt BASE_RANGE=1-12 BARCODED_READ=1 DISCARD_READ=false BASE_QUALITY=10 NUM_BASES_BELOW_QUALITY=1 TAG_NAME=XC TAG_BARCODED_READ=false HARD_CLIP_BASES=false TAG_QUALITY=XQ VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false [Wed Jul 03 08:48:48 CST 2019] Executing as yxtu@login1 on Linux 3.10.0-327.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_65-b17; Deflater: Jdk; Inflater: Jdk; Provider GCS is not available; Picard version: 2.3.0(34e6572_1555443285) 08:48:48.420 WARN IntelDeflaterFactory - IntelInflater is not supported, using Java.util.zip.Inflater 08:48:48.449 WARN IntelDeflaterFactory - IntelDeflater is not supported, using Java.util.zip.Deflater [Wed Jul 03 08:48:48 CST 2019] org.broadinstitute.dropseqrna.utils.TagBamWithReadSequenceExtended done. Elapsed time: 0.00 minutes. Runtime.totalMemory()=1009254400 Exception in thread "main" org.broadinstitute.dropseqrna.TranscriptomeException: Base [10] was requested, but the read isn't long enough [CCCCAAATT] at org.broadinstitute.dropseqrna.utils.BaseQualityFilter.scoreBaseQuality(BaseQualityFilter.java:45) at org.broadinstitute.dropseqrna.utils.TagBamWithReadSequenceExtended.processSingleRead(TagBamWithReadSequenceExtended.java:163) at org.broadinstitute.dropseqrna.utils.TagBamWithReadSequenceExtended.doWork(TagBamWithReadSequenceExtended.java:132) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103) at org.broadinstitute.dropseqrna.cmdline.DropSeqMain.main(DropSeqMain.java:42) It seems that not only shows program problems but also shows that my input reads are only 10bases long. what should I do?

Hi @alecw I know what you mean, I looked at the prompt carefully, it is really only 10bp, but I don't know why. Is it related to downloading the fastq file? I downloaded the data in the article, the description shows the paired-end data, but I only have one sample per sample when I download the data in EBI. Fastq file, and sra file can not be split into two files by fastq-dump - -split-, what is the situation? Can I convert a fastq file directly into a ubam file? What should I do to be correct?

alecw commented 5 years ago

Hi @Yichel518 ,

What data did you download? Is it paired-end? What do you think your read lengths are? What do you mean "I only have one sample per sample?"

-Alec

Yichel518 commented 5 years ago

嗨@ Yichel518,

你下载了什么数据?是配对吗?您认为您的阅读长度是多少?你是什​​么意思“我每个样品只有一个样品?”

Hi @alecw I downloaded a subset of data from the article "Single-cell reconstruction of developmental trajectories during zebrafish embryogenesis". It is form "Drop-seq analysis of wild-type (TLAB) zebrafish embryos from high to 6-somite stage (12 Timepoints)", and I downloaded the fatq file again from EBI, but I found that although it is double-ended data, there is only one fastq file. image

alecw commented 5 years ago

Hi @Yichel518 ,

Can you provide a download link for the fastq you downloaded?

Regards, Alec

Yichel518 commented 5 years ago

Hi@alecw I downloaded from "https://www.ebi.ac.uk/ena/data/view/PRJNA417290" and one of these fastq link is ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR626/007/SRR6261597/SRR6261597.fastq.gz

Many thanks

Yichel518 commented 5 years ago

I have read some other data, and PARIED must have two files or bam files. I don't understand why I downloaded only one fastq. Are these files processed?

jamesnemesh commented 5 years ago

I tried to pull 1 SRA file from ENA and extract the fastq files (yes, they should be paired) so that you could start to reprocess the data.

——————— Download SRA toolkit: https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software

Run fastq-dump https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump

Get your data: wget ftp://ftp.sra.ebi.ac.uk/vol1/srr/SRR626/007/SRR6261577 ftp://ftp.sra.ebi.ac.uk/vol1/srr/SRR626/007/SRR6261577

Generate paired fastq files: fastq-dump -I --split-files SRR6261577

Look at fastq file: more SRR6261577_1.fastq @SRR6261577.1.1 1 length=9 CCCCAAATT +SRR6261577.1.1 1 length=9 AAAAAEEEE @SRR6261577.2.1 2 length=9 CCCCAAATT +SRR6261577.2.1 2 length=9 AAAAAEEEE @SRR6261577.3.1 3 length=62 CAGTTTTCAGAGATTAATTTCAGTGTTTAATTTTCACTGCTGAAGGTCAAAACAATAGAGAA

————————————————

Here’s the conclusion I’d draw from this:

The fastq files on offer have already been processed. The read lengths are variable because they’ve already been polyA trimmed and adapter trimmed. Unfortunately, when NCBI receives the data, they for no particularly good reason get rid of all the BAM tags, which means the cell and molecular barcodes are lost, rendering the data unusable. One of the members of our lab went through finding out this happened to their submission of a data set and they had to contact NCBI to properly recover the data. Supposedly they were able to recover that data in a usable format - at least the BAM files with the biological read AND the cell/molecular barcode tags, so they could have an expression matrix extracted from the data set.

It seems like if you want to use this data, you’ll have to contact the Schier Lab so they can talk to NCBI to get those tags restored - until then there’s not much you can do with the data aside from pretending it’s bulk RNASeq data.

-Jim Nemesh

On Jul 3, 2019, at 8:53 PM, Yichel518 notifications@github.com wrote:

I have read some other data, and PARIED must have two files or bam files. I don't understand why I downloaded only one fastq. Are these files processed?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/Drop-seq/issues/134?email_source=notifications&email_token=ABCZXJ2277YAEMF4KM4YZODP5VCZ7A5CNFSM4H5A4RAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZGAVJY#issuecomment-508299943, or mute the thread https://github.com/notifications/unsubscribe-auth/ABCZXJYVINNHYL6KMCWY3MLP5VCZ7ANCNFSM4H5A4RAA.

Yichel518 commented 5 years ago

Many thanks @jamesnemesh My original intention was to learn how to reconstruct the trajectory of cell development. I am not sure whether Schier Lab can help me, so I would like to ask if there is any similar public data in the laboratory for us to reconstruct the data of the cell development tree. Regards, Yichel

jamesnemesh commented 5 years ago

You might direct your question to the google group (https://groups.google.com/forum/#!forum/dropseq https://groups.google.com/forum/#!forum/dropseq) . This GitHub is strictly for support of our software, and you’re clearly far away from this being a software issue.

-Jim

On Jul 3, 2019, at 9:34 PM, Yichel518 notifications@github.com wrote:

Many thanks @jamesnemesh https://github.com/jamesnemesh My original intention was to learn how to reconstruct the trajectory of cell development. I am not sure whether Schier Lab can help me, so I would like to ask if there is any similar public data in the laboratory for us to reconstruct the data of the cell development tree. Regards, Yichel

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/Drop-seq/issues/134?email_source=notifications&email_token=ABCZXJ5FA5TPSQN2CF4QGGDP5VHRDA5CNFSM4H5A4RAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZGCFCI#issuecomment-508306057, or mute the thread https://github.com/notifications/unsubscribe-auth/ABCZXJ455PSKU75SJJMIWLDP5VHRDANCNFSM4H5A4RAA.

Yichel518 commented 5 years ago

You are right, thank you very much.