Open RichardCorbett opened 4 years ago
You seem have it set up and run it the right way, for some reason umi-tools however think there are no barcodes in your sample:
ValueError: --set-cell-barcode option specifies more cell barcodes than the number of observed cell barcodes
0 cell barcodes observed from 100000000 parsed reads
umi-tools estimates the number of barcodes from this file: /projects/rcorbettprj2/TechD/dolomite/results/samples/sample4/trimmed_repaired_R1.fastq.gz
, and I suspect something is wrong with it.
Could you inspect this, e.g. looking at the first entries zcat trimmed_repaired_R1.fastq.gz | head -n 100
etc. Feel free to share the resulting lines here if you not sure what to expect.
You can also run umi-tools manually ny activating the conda environment to play arround with parameters or so:
conda activate /projects/rcorbettprj2/TechD/dolomite/dropSeqPipe/.snakemake/conda/bf0643d5
umi_tools whitelist --stdin /projects/rcorbettprj2/TechD/dolomite/results/samples/sample4/trimmed_repaired_R1.fastq.gz --bc-pattern='(?P<cell_1>.{12})(?P<umi_1>.{8})' --extract-method=regex --set-cell-number=1200
Thanks @seb-mueller,
I can confirm that trimmed_repaired_R1.fastq.gz is completely empty. I looked through the logs to try and guess which upstream command creates these files where I see this command:
java -ea -Xmx120g -cp /projects/rcorbettprj2/TechD/dolomite/dropSeqPipe/.snakemake/conda/3e00e686/opt/bbmap-38.22-1/current/ jgi.SplitPairsAndSingles rp -Xmx120g in=/projects/rcorbettprj2/TechD/dolomite/results/samples/sample1/trimmed_R1.fastq.gz in2=/projects/rcorbettprj2/TechD/dolomite/results/samples/sample1/trimmed_R2.fastq.gz out1=/projects/rcorbettprj2/TechD/dolomite/results/samples/sample1/trimmed_repaired_R1.fastq.gz out2=/projects/rcorbettprj2/TechD/dolomite/results/samples/sample1/trimmed_repaired_R2.fastq.gz repair=t threads=4
The trimmed_R*.fastq.gz files do seem to be created by cutadapt in the early steps with what looks like correct-ish results, but they seem to be gone when the run completes.
I suspect that there is a problem with my fastq format, which I've seen before when using bbMap. My fastq entry names have the following format where there is both a trailing "/2" in the read name and a leading "2" in the second field.
I'll try cleaning up the names a little to see if that helps.
@HS28_337:3:1101:11073:1999/2 2:N:0:TCCTGAGC
Used this command:
zcat sample1_R1.fastq.gz | sed 's/\/1 / /' | gzip > sample1_R1.fastq_clean.gz
zcat sample1_R2.fastq.gz | sed 's/\/2 / /' | gzip > sample1_R2.fastq_clean.gz
and am now getting non-empty files and the analysis is progressing to the alignment stage.
Thats impressive bugfixing work Richard! Since you have seen this before (and might therefore occur for others), this might be material for the trouble-shooting section. Do you have a link or so that elaborates on this problem? Also, bbmap might benefit from handling those files, so we can maybe send them a bug report.
I suspect that not many others are getting read names like ours. I think we have that mix of formats for some backwards compatibility issues in our pipeline and I ran into a similar error when using a different bbMap tool last year. From my understanding of the standard fastq formats we are the outlier. That said Brian Bushnell over at bbMap might have some text stipulating the fastq formats bbmap can handle.
My run after fixing the read names got real close to the end, but still appears to have hit an error (let me know if you'd prefer a different issue of this):
Activating conda environment:
/projects/rcorbettprj2/TechD/dolomite/dropSeqPipe/.snakemake/conda/c06ad25c INFO 2020-02-07 22:54:17 SingleCellRnaSeqMetricsCollector Accumulating metrics 10,000,000 records. Elapsed time: 00:01:27s. Time for last 1,000,000: 8s. Last read position: MT:1,930
Attaching package: ‘cowplot’
The following object is masked from ‘package:ggplot2’:
ggsave
INFO 2020-02-07 22:54:26 SingleCellRnaSeqMetricsCollector Accumulating metrics 11,000,000 records. Elapsed time: 00:01:36s. Time for last 1,000,000: 9s. Last read position: 7:928,092 [Fri Feb 07 22:54:27 PST 2020] org.broadinstitute.dropseqrna.barnyard.DigitalExpression done. Elapsed time: 11.92 minutes. Runtime.totalMemory()=27037007872 Error in Read10X(file.path(snakemake@wildcards$results_dir, "summary", : Directory provided does not exist Execution halted [Fri Feb 7 22:54:33 2020] Error in rule violine_plots: jobid: 31 output: /projects/rcorbettprj2/TechD/dolomite/results_hg38/plots/violinplots_comparison_UMI.pdf, /projects/rcorbettprj2/TechD/dolomite/results_hg38/plots/UMI_vs_counts.pdf, /projects/rcorbettprj2/TechD/dolomite/results_hg38/plots/UMI_vs_gene.pdf, /projects/rcorbettprj2/TechD/dolomite/results_hg38/plots/Count_vs_gene.pdf, /projects/rcorbettprj2/TechD/dolomite/results_hg38/summary/R_Seurat_objects.rdata conda-env: /projects/rcorbettprj2/TechD/dolomite/dropSeqPipe/.snakemake/conda/c06ad25c
RuleException: CalledProcessError in line 47 of /projects/rcorbettprj2/TechD/dolomite/dropSeqPipe/rules/merge.smk: Command 'source /home/rcorbett/miniconda3/bin/activate '/projects/rcorbettprj2/TechD/dolomite/dropSeqPipe/.snakemake/conda/c06ad25c'; set -euo pipefail; Rscript --vanilla /projects/rcorbettprj2/TechD/dolomite/dropSeqPipe/.snakemake/scripts/tmpv4agpdro.plot_violine.R' returned non-zero exit status 1. File "/projects/rcorbettprj2/TechD/dolomite/dropSeqPipe/rules/merge.smk", line 47, in __rule_violine_plots File "/home/rcorbett/miniconda3/envs/dropSeqPipe/lib/python3.6/concurrent/futures/thread.py", line 56, in run Removing temporary output file /projects/rcorbettprj2/TechD/dolomite/results_hg38/samples/sample4_clean/read/expression.tsv. [Fri Feb 7 22:54:33 2020] Finished job 78. 106 of 113 steps (94%) done INFO 2020-02-07 22:54:35 SingleCellRnaSeqMetricsCollector Accumulating metrics 12,000,000 records. Elapsed time: 00:01:45s. Time for last 1,000,000: 8s. Last read position: 16:1,962,091 INFO 2020-02-07 22:54:45 SingleCellRnaSeqMetricsCollector Accumulating metrics 13,000,000 records. Elapsed time: 00:01:55s. Time for last 1,000,000: 9s. Last read position: 3:12,839,391 INFO 2020-02-07 22:54:54 SingleCellRnaSeqMetricsCollector Accumulating metrics 14,000,000 records. Elapsed time: 00:02:04s. Time for last 1,000,000: 9s. Last read position: MT:1,867 INFO 2020-02-07 22:55:02 SingleCellRnaSeqMetricsCollector Accumulating metrics 15,000,000 records. Elapsed time: 00:02:12s. Time for last 1,000,000: 8s. Last read position: MT:1,544 INFO 2020-02-07 22:55:10 SingleCellRnaSeqMetricsCollector Accumulating metrics 16,000,000 records. Elapsed time: 00:02:20s. Time for last 1,000,000: 8s. Last read position: 3:188,875,046 INFO 2020-02-07 22:55:18 SingleCellRnaSeqMetricsCollector Accumulating metrics 17,000,000 records. Elapsed time: 00:02:28s. Time for last 1,000,000: 7s. Last read position: MT:2,056 INFO 2020-02-07 22:55:26 SingleCellRnaSeqMetricsCollector Accumulating metrics 18,000,000 records. Elapsed time: 00:02:36s. Time for last 1,000,000: 7s. Last read position: 15:69,455,475 INFO 2020-02-07 22:55:33 SingleCellRnaSeqMetricsCollector Accumulating metrics 19,000,000 records. Elapsed time: 00:02:43s. Time for last 1,000,000: 7s. Last read position: 1:111,588,056 INFO 2020-02-07 22:55:41 SingleCellRnaSeqMetricsCollector Accumulating metrics 20,000,000 records. Elapsed time: 00:02:51s. Time for last 1,000,000: 7s. Last read position: 4:152,525,950 INFO 2020-02-07 22:55:48 SingleCellRnaSeqMetricsCollector Accumulating metrics 21,000,000 records. Elapsed time: 00:02:58s. Time for last 1,000,000: 7s. Last read position: 19:8,322,286 INFO 2020-02-07 22:55:55 SingleCellRnaSeqMetricsCollector Accumulating metrics 22,000,000 records. Elapsed time: 00:03:05s. Time for last 1,000,000: 6s. Last read position: 13:32,533,540 INFO 2020-02-07 22:56:01 SingleCellRnaSeqMetricsCollector Accumulating metrics 23,000,000 records. Elapsed time: 00:03:11s. Time for last 1,000,000: 6s. Last read position: 5:71,466,163 INFO 2020-02-07 22:56:09 SingleCellRnaSeqMetricsCollector Accumulating metrics 24,000,000 records. Elapsed time: 00:03:19s. Time for last 1,000,000: 7s. Last read position: 11:62,852,017 INFO 2020-02-07 22:56:16 SingleCellRnaSeqMetricsCollector Accumulating metrics 25,000,000 records. Elapsed time: 00:03:26s. Time for last 1,000,000: 7s. Last read position: 1:89,106,164 INFO 2020-02-07 22:56:24 SingleCellRnaSeqMetricsCollector Accumulating metrics 26,000,000 records. Elapsed time: 00:03:34s. Time for last 1,000,000: 7s. Last read position: 1:171,589,331 INFO 2020-02-07 22:56:31 SingleCellRnaSeqMetricsCollector Accumulating metrics 27,000,000 records. Elapsed time: 00:03:41s. Time for last 1,000,000: 6s. Last read position: 13:45,339,532 INFO 2020-02-07 22:56:37 SingleCellRnaSeqMetricsCollector Accumulating metrics 28,000,000 records. Elapsed time: 00:03:47s. Time for last 1,000,000: 6s. Last read position: 4:152,511,093 INFO 2020-02-07 22:56:45 SingleCellRnaSeqMetricsCollector Accumulating metrics 29,000,000 records. Elapsed time: 00:03:55s. Time for last 1,000,000: 7s. Last read position: 11:65,499,708 INFO 2020-02-07 22:56:52 SingleCellRnaSeqMetricsCollector Accumulating metrics 30,000,000 records. Elapsed time: 00:04:02s. Time for last 1,000,000: 6s. Last read position: 7:37,382,576 INFO 2020-02-07 22:56:59 SingleCellRnaSeqMetricsCollector Accumulating metrics 31,000,000 records. Elapsed time: 00:04:09s. Time for last 1,000,000: 6s. Last read position: 14:23,971,366 INFO 2020-02-07 22:57:05 SingleCellRnaSeqMetricsCollector Accumulating metrics 32,000,000 records. Elapsed time: 00:04:16s. Time for last 1,000,000: 6s. Last read position: 1:30,733,456 INFO 2020-02-07 22:57:13 SingleCellRnaSeqMetricsCollector Accumulating metrics 33,000,000 records. Elapsed time: 00:04:23s. Time for last 1,000,000: 7s. Last read position: 19:4,715,092 INFO 2020-02-07 22:57:21 SingleCellRnaSeqMetricsCollector Accumulating metrics 34,000,000 records. Elapsed time: 00:04:31s. Time for last 1,000,000: 7s. Last read position: 3:49,268,606 INFO 2020-02-07 22:57:25 SingleCellRnaSeqMetricsCollector Adding metrics to file. This may take a while, with no progress messages. [Fri Feb 07 22:57:26 PST 2020] org.broadinstitute.dropseqrna.barnyard.SingleCellRnaSeqMetricsCollector done. Elapsed time: 14.90 minutes. Runtime.totalMemory()=14865137664 [Fri Feb 7 22:57:30 2020] Finished job 74. 107 of 113 steps (95%) done Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message
Indeed, if this drags out we might transfer this to its own issue (not sure how this is handled elsewere though). Anyway, the key errors seems here:
Error in Read10X(file.path(snakemake@wildcards$results_dir, "summary", :
Directory provided does not exist
I haven't had this error before, but suspect this has to do with how the working directory is set. Normally I call snakemake from the cloned archive (containing the Snakefile) and specifiying the working dir using --directory <absolute-path>
. Have you tried it this way? Also, do you have the following in the config.yaml
? (especially the result bit):
LOCAL:
temp-directory: ./tmp
memory: 60g
raw_data: data
results: results
Lastly, it might help if you set DEBUG: True
which gives you a R object to inspect. I'd elaborate if the error persists.
I get a similar error running DropSeq data (Mascosko 2015). I downloaded files from GEO and SRA and did some filtering.
SRR1873277_S1_L001_R1_001.fastq
@AH2HM7BGXX:1:11101:10007:10911
TAACGTACAGGAGGAGTACG
+AH2HM7BGXX:1:11101:10007:10911
AA7AAFAFFFAFFFFFAAAF
@AH2HM7BGXX:1:11101:10008:15270
GCATGAAACTTCCACGGGCC
+AH2HM7BGXX:1:11101:10008:15270
AAAAAFF7FFF.F<FFFFFF
@AH2HM7BGXX:1:11101:10016:7370
CTGGATGTTCAAGGGTCTCA
SRR1873277_S1_L001_R2_001.fastq
@AH2HM7BGXX:1:11101:10007:10911
GTACTAACGTACAGGAAAGGACCTTTTTTTTTTTTTTTTTTTTTTTTTTTT
+
<..<7..777)<.FF..FF<F.FFAAAA<.7.777<77.<<7..7..7...
@AH2HM7BGXX:1:11101:10008:15270
GAAGTACCAGGCAGTGACAGCCACCCTGGAGGAGAAGAGGAAAGAGAAAGCCAAGATCCA
+
<AAAAFF.FF.F<FAF7FAFFFFFFFFFFF7AAFFFFFFFAF<AFAFF..FF<F7AFAFF
@AH2HM7BGXX:1:11101:10016:7370
TATTAAAAGAATCCAAATTCAAACT
I get this issue running version 0.5 or the current develop
branch acef6df06836
My understanding is that DigitalExpression
or umitools
no longer accept empty barcode whitelists. As a workaround, I generated one by permutations and added it to config.yaml
.
echo AAAA{A,T,C,G}{A,T,C,G}{A,T,C,G}{A,T,C,G}{A,T,C,G}{A,T,C,G}{A,T,C,G}{A,T,C,G}{A,T,C,G}{A,T,C,G}{A,T,C,G}{A,T,C,G} | sed 's/ /\n/g' > dropseq_barcode.txt
FILTER:
barcode-whitelist: 'dropseq_barcode.txt'
This hangs on the extend_barcode_whitelist
rule, presumably because the number of barcodes to compute mismatches for is too large.
After seeing this, I think the FASTQ file may be problem, I seem to be missing second field from the headers. Is this required?
In my case, I was able to fix it by adding the second field to the header.
sed -i "1~4s/$/ 1:N:0:0/g" SRR1873277_S1_L001_R[12]_001.fastq
With this get_cell_whitelist
, extend_barcode_top
, and repair_barcodes
was able to run without specifying a barcode barcode-whitelist: ''
.
I hope this helps anyone running into similar problems but it may help to check for this problem in the pipeline (or remove the requirement for compatible headers if not needed).
Update
This gives a "Found 0 cell barcodes in file" error.
19 of 35 steps (54%) done
[Fri Jun 12 18:29:16 2020]
rule extract_umi_expression:
input: /home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/final.bam, /home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/barcodes.csv
output: /home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/umi/expression.long, /home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/umi/expression.tsv
jobid: 41
wildcards: results_dir=/home/tom/cellranger_convert/test-dropseqpipe-dev, sample=SRR1873277_S1_L001
export _JAVA_OPTIONS=-Djava.io.tmpdir=./tmp && DigitalExpression -m 10g I=/home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/final.bam O=/home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/umi/expression.tsv EDIT_DISTANCE=1 OUTPUT_LONG_FORMAT=/home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/umi/expression.long STRAND_STRATEGY=SENSE OUTPUT_READS_INSTEAD=false LOCUS_FUNCTION_LIST=null LOCUS_FUNCTION_LIST={CODING,UTR} MIN_BC_READ_THRESHOLD=0 CELL_BC_FILE=/home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/barcodes.csv
Activating conda environment: /home/tom/local/bin/dropSeqPipe-0.6/.snakemake/conda/f7468c03
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=./tmp
INFO 2020-06-12 18:29:19 DigitalExpression
********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
********** DigitalExpression -I /home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/final.bam -O /home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/umi/expression.tsv -EDIT_DISTANCE 1 -OUTPUT_LONG_FORMAT /home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/umi/expression.long -STRAND_STRATEGY SENSE -OUTPUT_READS_INSTEAD false -LOCUS_FUNCTION_LIST null -LOCUS_FUNCTION_LIST CODING -LOCUS_FUNCTION_LIST UTR -MIN_BC_READ_THRESHOLD 0 -CELL_BC_FILE /home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/barcodes.csv
**********
18:29:19.810 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/tom/local/bin/dropSeqPipe-0.6/.snakemake/conda/f7468c03/share/dropseq_tools-2.0.0-0/jar/lib/picard-2.18.14.jar!/com/intel/gkl/native/libgkl_compression.so
[Fri Jun 12 18:29:19 JST 2020] DigitalExpression OUTPUT_READS_INSTEAD=false OUTPUT=/home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/umi/expression.tsv OUTPUT_LONG_FORMAT=/home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/umi/expression.long INPUT=/home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/final.bam EDIT_DISTANCE=1 MIN_BC_READ_THRESHOLD=0 CELL_BC_FILE=/home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/barcodes.csv STRAND_STRATEGY=SENSE LOCUS_FUNCTION_LIST=[CODING, UTR] CELL_BARCODE_TAG=XC MOLECULAR_BARCODE_TAG=XM READ_MQ=10 USE_STRAND_INFO=true RARE_UMI_FILTER_THRESHOLD=0.0 GENE_NAME_TAG=gn GENE_STRAND_TAG=gs GENE_FUNCTION_TAG=gf VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Fri Jun 12 18:29:19 JST 2020] Executing as tom@d4-lm3 on Linux 4.15.0-45-generic amd64; OpenJDK 64-Bit Server VM 11.0.1+13-LTS; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.0.0(1ef3a59_1539205128)
INFO 2020-06-12 18:29:19 BarcodeListRetrieval *Found 0 cell barcodes in file*
ERROR 2020-06-12 18:29:19 DigitalExpression Running digital expression without somehow selecting a set of barcodes to process no longer supported.
[Fri Jun 12 18:29:19 JST 2020] org.broadinstitute.dropseqrna.barnyard.DigitalExpression done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=2058354688
[Fri Jun 12 18:29:19 2020]
Error in rule extract_umi_expression:
jobid: 41
output: /home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/umi/expression.long, /home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/umi/expression.tsv
conda-env: /home/tom/local/bin/dropSeqPipe-0.6/.snakemake/conda/f7468c03
shell:
export _JAVA_OPTIONS=-Djava.io.tmpdir=./tmp && DigitalExpression -m 10g I=/home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/final.bam O=/home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/umi/expression.tsv EDIT_DISTANCE=1 OUTPUT_LONG_FORMAT=/home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/umi/expression.long STRAND_STRATEGY=SENSE OUTPUT_READS_INSTEAD=false LOCUS_FUNCTION_LIST=null LOCUS_FUNCTION_LIST={CODING,UTR} MIN_BC_READ_THRESHOLD=0 CELL_BC_FILE=/home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/barcodes.csv
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
I'm guessing this is because the barcodes are parsed from the header rather than the 1st 8 bases of the R1. In this case, the code to fix the files is:
perl -0p -i -e "s/@(.*)\n(.{8})/@\1 1:N:0:\2\n\2/g" SRR1873277_S1_L001_R1_001.fastq
Update 2
I can confirm this corrects the input format and runs to completion for the develop
branch. I run into some issues running "rule repair" for version 0.4 (#72) or "violine_plot" for 0.5 (#91, #94) but I think these issues have already been discussed elsewhere. Version 0.5 seems to call the R instance from the PATH with Seurat 3 installed, instead of the conda version with Seurat 2. If anyone runs into this issue with DropSeqPipe version 0.5, a workaround is to add an older R version to the PATH that supports Seurat v2 (R 3.4.0 < version < 3.6.0 seems to work).
Hi folks,
I'm trying dropSeeqPipe for the first time on some test Nadia data we generated. I am using the example Nadia template config and I have 4 samples, each with paired-end 125bp reads.
When I start snakemake it seems that fastqc completes successfully, and most of the cutadapt jobs are successful, but things go awry when my first bbmap/whitelist/umi_tools commands are run.
I'll try running a single sample to see if I can isolate the error, but I'd be interested to hear if you have any advice.
thanks Richard
Here's a trace...