First test with Dolomite Nadia data

RichardCorbett commented 4 years ago

Hi folks,

I'm trying dropSeeqPipe for the first time on some test Nadia data we generated. I am using the example Nadia template config and I have 4 samples, each with paired-end 125bp reads.

When I start snakemake it seems that fastqc completes successfully, and most of the cutadapt jobs are successful, but things go awry when my first bbmap/whitelist/umi_tools commands are run.

I'll try running a single sample to see if I can isolate the error, but I'd be interested to hear if you have any advice.

thanks Richard

Here's a trace...

... Activating conda environment: /projects/rcorbettprj2/TechD/dolomite/dropSeqPipe/.snakemake/conda/cb2ceeea # output generated by whitelist --stdin /projects/rcorbettprj2/TechD/dolomite/results/samples/sample4/trimmed_repaired_R1.fastq.gz --bc-pattern=(?P.{12})(?P.{8}) --extract-method=regex --set-cell-number=1200 --log2stderr # job started at Thu Feb 6 12:29:11 2020 on gphost01.bcgsc.ca -- a684f4f1-1284-4512-a5e2-a86637a5714a # pid: 41642, system: Linux 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 x86_64 # blacklist_tsv : None # cell_number : 1200 # compresslevel : 6 # error_correct_threshold : 1 # expect_cells : False # extract_method : regex # filter_cell_barcodes : False # log2stderr : True # loglevel : 1 # method : reads # pattern : (?P.{12})(?P.{8}) # pattern2 : None # plot_prefix : None # prime3 : None # random_seed : None # read2_in : None # short_help : None # stderr : <_io.TextIOWrapper name='' mode='w' encoding='UTF-8'> # stdin : <_io.TextIOWrapper name='/projects/rcorbettprj2/TechD/dolomite/results/samples/sample4/trimmed_repaired_R1.fastq.gz' encoding='ascii'> # stdlog : <_io.TextIOWrapper name='' mode='w' encoding='UTF-8'> # stdout : <_io.TextIOWrapper name='' mode='w' encoding='UTF-8'> # subset_reads : 100000000 # timeit_file : None # timeit_header : None # timeit_name : all # whitelist_tsv : None 2020-02-06 12:29:11,987 INFO Starting barcode extraction 2020-02-06 12:29:11,987 INFO Starting - whitelist determination Traceback (most recent call last): File "/projects/rcorbettprj2/TechD/dolomite/dropSeqPipe/.snakemake/conda/bf0643d5/bin/umi_tools", line 11, in sys.exit(main()) File "/projects/rcorbettprj2/TechD/dolomite/dropSeqPipe/.snakemake/conda/bf0643d5/lib/python3.6/site-packages/umi_tools/umi_tools.py", line 57, in main module.main(sys.argv) File "/projects/rcorbettprj2/TechD/dolomite/dropSeqPipe/.snakemake/conda/bf0643d5/lib/python3.6/site-packages/umi_tools/whitelist.py", line 364, in main options.cell_number)) ValueError: --set-cell-barcode option specifies more cell barcodes than the number of observed cell barcodes. This may be because --subset-reads was set to a value too low to capture reads from all cells. 0 cell barcodes observed from 100000000 parsed reads. Expected>= 1200 cell barcodes [Thu Feb 6 12:29:12 2020] Error in rule get_top_barcodes: jobid: 92 output: /projects/rcorbettprj2/TechD/dolomite/results/samples/sample4/top_barcodes.csv conda-env: /projects/rcorbettprj2/TechD/dolomite/dropSeqPipe/.snakemake/conda/bf0643d5 shell: umi_tools whitelist --stdin /projects/rcorbettprj2/TechD/dolomite/results/samples/sample4/trimmed_repaired_R1.fastq.gz --bc-pattern='(?P.{12})(?P.{8})' --extract-method=regex --set-cell-number=1200 --log2stderr > /projects/rcorbettprj2/TechD/dolomite/results/samples/sample4/top_barcodes.csv (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job get_top_barcodes since they might be corrupted: /projects/rcorbettprj2/TechD/dolomite/results/samples/sample4/top_barcodes.csv Removing temporary output file /projects/rcorbettprj2/TechD/dolomite/results/samples/sample4/Aligned.out.bam. [Thu Feb 6 12:29:18 2020] Finished job 115. 20 of 108 steps (19%) done [Thu Feb 6 12:46:47 2020] Finished job 47. 21 of 108 steps (19%) done Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: /projects/rcorbettprj2/TechD/dolomite/dropSeqPipe/.snakemake/log/2020-02-06T103811.148029.snakemake.log

seb-mueller commented 4 years ago

You seem have it set up and run it the right way, for some reason umi-tools however think there are no barcodes in your sample:

ValueError: --set-cell-barcode option specifies more cell barcodes than the number of observed cell barcodes
 0 cell barcodes observed from 100000000 parsed reads

umi-tools estimates the number of barcodes from this file: /projects/rcorbettprj2/TechD/dolomite/results/samples/sample4/trimmed_repaired_R1.fastq.gz, and I suspect something is wrong with it. Could you inspect this, e.g. looking at the first entries zcat trimmed_repaired_R1.fastq.gz | head -n 100 etc. Feel free to share the resulting lines here if you not sure what to expect.

You can also run umi-tools manually ny activating the conda environment to play arround with parameters or so:

conda activate /projects/rcorbettprj2/TechD/dolomite/dropSeqPipe/.snakemake/conda/bf0643d5
umi_tools whitelist --stdin /projects/rcorbettprj2/TechD/dolomite/results/samples/sample4/trimmed_repaired_R1.fastq.gz --bc-pattern='(?P<cell_1>.{12})(?P<umi_1>.{8})' --extract-method=regex --set-cell-number=1200

RichardCorbett commented 4 years ago

Thanks @seb-mueller,

I can confirm that trimmed_repaired_R1.fastq.gz is completely empty. I looked through the logs to try and guess which upstream command creates these files where I see this command:

java -ea -Xmx120g -cp /projects/rcorbettprj2/TechD/dolomite/dropSeqPipe/.snakemake/conda/3e00e686/opt/bbmap-38.22-1/current/ jgi.SplitPairsAndSingles rp -Xmx120g in=/projects/rcorbettprj2/TechD/dolomite/results/samples/sample1/trimmed_R1.fastq.gz in2=/projects/rcorbettprj2/TechD/dolomite/results/samples/sample1/trimmed_R2.fastq.gz out1=/projects/rcorbettprj2/TechD/dolomite/results/samples/sample1/trimmed_repaired_R1.fastq.gz out2=/projects/rcorbettprj2/TechD/dolomite/results/samples/sample1/trimmed_repaired_R2.fastq.gz repair=t threads=4

The trimmed_R*.fastq.gz files do seem to be created by cutadapt in the early steps with what looks like correct-ish results, but they seem to be gone when the run completes.

RichardCorbett commented 4 years ago

I suspect that there is a problem with my fastq format, which I've seen before when using bbMap. My fastq entry names have the following format where there is both a trailing "/2" in the read name and a leading "2" in the second field. I'll try cleaning up the names a little to see if that helps. @HS28_337:3:1101:11073:1999/2 2:N:0:TCCTGAGC

RichardCorbett commented 4 years ago

Used this command:

zcat sample1_R1.fastq.gz | sed 's/\/1 / /' | gzip > sample1_R1.fastq_clean.gz
zcat sample1_R2.fastq.gz | sed 's/\/2 / /' | gzip > sample1_R2.fastq_clean.gz

and am now getting non-empty files and the analysis is progressing to the alignment stage.

seb-mueller commented 4 years ago

Thats impressive bugfixing work Richard! Since you have seen this before (and might therefore occur for others), this might be material for the trouble-shooting section. Do you have a link or so that elaborates on this problem? Also, bbmap might benefit from handling those files, so we can maybe send them a bug report.

RichardCorbett commented 4 years ago

I suspect that not many others are getting read names like ours. I think we have that mix of formats for some backwards compatibility issues in our pipeline and I ran into a similar error when using a different bbMap tool last year. From my understanding of the standard fastq formats we are the outlier. That said Brian Bushnell over at bbMap might have some text stipulating the fastq formats bbmap can handle.

My run after fixing the read names got real close to the end, but still appears to have hit an error (let me know if you'd prefer a different issue of this):

Activating conda environment:

/projects/rcorbettprj2/TechD/dolomite/dropSeqPipe/.snakemake/conda/c06ad25c INFO 2020-02-07 22:54:17 SingleCellRnaSeqMetricsCollector Accumulating metrics 10,000,000 records. Elapsed time: 00:01:27s. Time for last 1,000,000: 8s. Last read position: MT:1,930

Attaching package: ‘cowplot’

The following object is masked from ‘package:ggplot2’:
ggsave
INFO 2020-02-07 22:54:26 SingleCellRnaSeqMetricsCollector Accumulating metrics 11,000,000 records. Elapsed time: 00:01:36s. Time for last 1,000,000: 9s. Last read position: 7:928,092 [Fri Feb 07 22:54:27 PST 2020] org.broadinstitute.dropseqrna.barnyard.DigitalExpression done. Elapsed time: 11.92 minutes. Runtime.totalMemory()=27037007872 Error in Read10X(file.path(snakemake@wildcards$results_dir, "summary", : Directory provided does not exist Execution halted [Fri Feb 7 22:54:33 2020] Error in rule violine_plots: jobid: 31 output: /projects/rcorbettprj2/TechD/dolomite/results_hg38/plots/violinplots_comparison_UMI.pdf, /projects/rcorbettprj2/TechD/dolomite/results_hg38/plots/UMI_vs_counts.pdf, /projects/rcorbettprj2/TechD/dolomite/results_hg38/plots/UMI_vs_gene.pdf, /projects/rcorbettprj2/TechD/dolomite/results_hg38/plots/Count_vs_gene.pdf, /projects/rcorbettprj2/TechD/dolomite/results_hg38/summary/R_Seurat_objects.rdata conda-env: /projects/rcorbettprj2/TechD/dolomite/dropSeqPipe/.snakemake/conda/c06ad25c

RuleException: CalledProcessError in line 47 of /projects/rcorbettprj2/Te Command 'source /home/rcorbett/miniconda3/bin/activate File "/projects/rcorbettprj2/TechD/dolomite/dropSeqPi File "/home/rcorbett/miniconda3/envs/dropSeqPipe/lib/ Removing temporary output file /projects/rcorbettprj2/Tech [Fri Feb 7 22:54:33 2020] Finished job 78. 106 of 113 steps (94%) done INFO 2020-02-07 22:54:35 SingleCellRnaSeqMetricsCollector INFO 2020-02-07 22:54:45 SingleCellRnaSeqMetricsCollector INFO 2020-02-07 22:54:54 SingleCellRnaSeqMetricsCollector INFO 2020-02-07 22:55:02 SingleCellRnaSeqMetricsCollector INFO 2020-02-07 22:55:10 SingleCellRnaSeqMetricsCollector INFO 2020-02-07 22:55:18 SingleCellRnaSeqMetricsCollector INFO 2020-02-07 22:55:26 SingleCellRnaSeqMetricsCollector INFO 2020-02-07 22:55:33 SingleCellRnaSeqMetricsCollector INFO 2020-02-07 22:55:41 SingleCellRnaSeqMetricsCollector INFO 2020-02-07 22:55:48 SingleCellRnaSeqMetricsCollector INFO 2020-02-07 22:55:55 SingleCellRnaSeqMetricsCollector INFO 2020-02-07 22:56:01 SingleCellRnaSeqMetricsCollector INFO 2020-02-07 22:56:09 SingleCellRnaSeqMetricsCollector INFO 2020-02-07 22:56:16 SingleCellRnaSeqMetricsCollector INFO 2020-02-07 22:56:24 SingleCellRnaSeqMetricsCollector INFO 2020-02-07 22:56:31 SingleCellRnaSeqMetricsCollector INFO 2020-02-07 22:56:37 SingleCellRnaSeqMetricsCollector INFO 2020-02-07 22:56:45 SingleCellRnaSeqMetricsCollector INFO 2020-02-07 22:56:52 SingleCellRnaSeqMetricsCollector INFO 2020-02-07 22:56:59 SingleCellRnaSeqMetricsCollector INFO 2020-02-07 22:57:05 SingleCellRnaSeqMetricsCollector INFO 2020-02-07 22:57:13 SingleCellRnaSeqMetricsCollector INFO 2020-02-07 22:57:21 SingleCellRnaSeqMetricsCollector INFO 2020-02-07 22:57:25 SingleCellRnaSeqMetricsCollector [Fri Feb 07 22:57:26 PST 2020] org.broadinstitute.dropseqr Runtime.totalMemory()=14865137664 [Fri Feb 7 22:57:30 2020] Finished job 74. 107 of 113 steps (95%) done Shutting down, this might take some time. Exiting because a job execution failed. Look above chD/dolomite/dropSeqPipe/rules/merge.smk: '/projects/rcorbettprj2/TechD/dolomite/dropSeqPipe/.snakemake/conda/c06ad25c'; set -euo pipefail; Rscript --vanilla /projects/rcorbettprj2/TechD/dolomite/dropSeqPipe/.snakemake/scripts/tmpv4agpdro.plot_violine.R' returned non-zero exit status 1. pe/rules/merge.smk", line 47, in __rule_violine_plots python3.6/concurrent/futures/thread.py", line 56, in run D/dolomite/results_hg38/samples/sample4_clean/read/expression.tsv. Accumulating metrics 12,000,000 records. Elapsed time: 00:01:45s. Time for last 1,000,000: 8s. Last read position: 16:1,962,091 Accumulating metrics 13,000,000 records. Elapsed time: 00:01:55s. Time for last 1,000,000: 9s. Last read position: 3:12,839,391 Accumulating metrics 14,000,000 records. Elapsed time: 00:02:04s. Time for last 1,000,000: 9s. Last read position: MT:1,867 Accumulating metrics 15,000,000 records. Elapsed time: 00:02:12s. Time for last 1,000,000: 8s. Last read position: MT:1,544 Accumulating metrics 16,000,000 records. Elapsed time: 00:02:20s. Time for last 1,000,000: 8s. Last read position: 3:188,875,046 Accumulating metrics 17,000,000 records. Elapsed time: 00:02:28s. Time for last 1,000,000: 7s. Last read position: MT:2,056 Accumulating metrics 18,000,000 records. Elapsed time: 00:02:36s. Time for last 1,000,000: 7s. Last read position: 15:69,455,475 Accumulating metrics 19,000,000 records. Elapsed time: 00:02:43s. Time for last 1,000,000: 7s. Last read position: 1:111,588,056 Accumulating metrics 20,000,000 records. Elapsed time: 00:02:51s. Time for last 1,000,000: 7s. Last read position: 4:152,525,950 Accumulating metrics 21,000,000 records. Elapsed time: 00:02:58s. Time for last 1,000,000: 7s. Last read position: 19:8,322,286 Accumulating metrics 22,000,000 records. Elapsed time: 00:03:05s. Time for last 1,000,000: 6s. Last read position: 13:32,533,540 Accumulating metrics 23,000,000 records. Elapsed time: 00:03:11s. Time for last 1,000,000: 6s. Last read position: 5:71,466,163 Accumulating metrics 24,000,000 records. Elapsed time: 00:03:19s. Time for last 1,000,000: 7s. Last read position: 11:62,852,017 Accumulating metrics 25,000,000 records. Elapsed time: 00:03:26s. Time for last 1,000,000: 7s. Last read position: 1:89,106,164 Accumulating metrics 26,000,000 records. Elapsed time: 00:03:34s. Time for last 1,000,000: 7s. Last read position: 1:171,589,331 Accumulating metrics 27,000,000 records. Elapsed time: 00:03:41s. Time for last 1,000,000: 6s. Last read position: 13:45,339,532 Accumulating metrics 28,000,000 records. Elapsed time: 00:03:47s. Time for last 1,000,000: 6s. Last read position: 4:152,511,093 Accumulating metrics 29,000,000 records. Elapsed time: 00:03:55s. Time for last 1,000,000: 7s. Last read position: 11:65,499,708 Accumulating metrics 30,000,000 records. Elapsed time: 00:04:02s. Time for last 1,000,000: 6s. Last read position: 7:37,382,576 Accumulating metrics 31,000,000 records. Elapsed time: 00:04:09s. Time for last 1,000,000: 6s. Last read position: 14:23,971,366 Accumulating metrics 32,000,000 records. Elapsed time: 00:04:16s. Time for last 1,000,000: 6s. Last read position: 1:30,733,456 Accumulating metrics 33,000,000 records. Elapsed time: 00:04:23s. Time for last 1,000,000: 7s. Last read position: 19:4,715,092 Accumulating metrics 34,000,000 records. Elapsed time: 00:04:31s. Time for last 1,000,000: 7s. Last read position: 3:49,268,606 Adding metrics to file. This may take a while, with no progress messages. na.barnyard.SingleCellRnaSeqMetricsCollector done. Elapsed time: 14.90 minutes. for error message

seb-mueller commented 4 years ago

Indeed, if this drags out we might transfer this to its own issue (not sure how this is handled elsewere though). Anyway, the key errors seems here:

Error in Read10X(file.path(snakemake@wildcards$results_dir, "summary", :
Directory provided does not exist

I haven't had this error before, but suspect this has to do with how the working directory is set. Normally I call snakemake from the cloned archive (containing the Snakefile) and specifiying the working dir using --directory <absolute-path>. Have you tried it this way? Also, do you have the following in the config.yaml? (especially the result bit):

LOCAL:
    temp-directory: ./tmp
    memory: 60g
    raw_data: data
    results: results

Lastly, it might help if you set DEBUG: True which gives you a R object to inspect. I'd elaborate if the error persists.

TomKellyGenetics commented 4 years ago

I get a similar error running DropSeq data (Mascosko 2015). I downloaded files from GEO and SRA and did some filtering.

SRR1873277_S1_L001_R1_001.fastq

@AH2HM7BGXX:1:11101:10007:10911
TAACGTACAGGAGGAGTACG
+AH2HM7BGXX:1:11101:10007:10911
AA7AAFAFFFAFFFFFAAAF
@AH2HM7BGXX:1:11101:10008:15270
GCATGAAACTTCCACGGGCC
+AH2HM7BGXX:1:11101:10008:15270
AAAAAFF7FFF.F<FFFFFF
@AH2HM7BGXX:1:11101:10016:7370
CTGGATGTTCAAGGGTCTCA

SRR1873277_S1_L001_R2_001.fastq

@AH2HM7BGXX:1:11101:10007:10911
GTACTAACGTACAGGAAAGGACCTTTTTTTTTTTTTTTTTTTTTTTTTTTT
+
<..<7..777)<.FF..FF<F.FFAAAA<.7.777<77.<<7..7..7...
@AH2HM7BGXX:1:11101:10008:15270
GAAGTACCAGGCAGTGACAGCCACCCTGGAGGAGAAGAGGAAAGAGAAAGCCAAGATCCA
+
<AAAAFF.FF.F<FAF7FAFFFFFFFFFFF7AAFFFFFFFAF<AFAFF..FF<F7AFAFF
@AH2HM7BGXX:1:11101:10016:7370
TATTAAAAGAATCCAAATTCAAACT

I get this issue running version 0.5 or the current develop branch acef6df06836

My understanding is that DigitalExpression or umitools no longer accept empty barcode whitelists. As a workaround, I generated one by permutations and added it to config.yaml.

echo AAAA{A,T,C,G}{A,T,C,G}{A,T,C,G}{A,T,C,G}{A,T,C,G}{A,T,C,G}{A,T,C,G}{A,T,C,G}{A,T,C,G}{A,T,C,G}{A,T,C,G}{A,T,C,G} | sed 's/ /\n/g' > dropseq_barcode.txt

FILTER:
    barcode-whitelist: 'dropseq_barcode.txt'

This hangs on the extend_barcode_whitelist rule, presumably because the number of barcodes to compute mismatches for is too large.

After seeing this, I think the FASTQ file may be problem, I seem to be missing second field from the headers. Is this required?

TomKellyGenetics commented 4 years ago

In my case, I was able to fix it by adding the second field to the header.

sed -i "1~4s/$/ 1:N:0:0/g" SRR1873277_S1_L001_R[12]_001.fastq

With this get_cell_whitelist, extend_barcode_top, and repair_barcodes was able to run without specifying a barcode barcode-whitelist: ''.

I hope this helps anyone running into similar problems but it may help to check for this problem in the pipeline (or remove the requirement for compatible headers if not needed).

Update

This gives a "Found 0 cell barcodes in file" error.

19 of 35 steps (54%) done

[Fri Jun 12 18:29:16 2020]
rule extract_umi_expression:
    input: /home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/final.bam, /home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/barcodes.csv
    output: /home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/umi/expression.long, /home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/umi/expression.tsv
    jobid: 41
    wildcards: results_dir=/home/tom/cellranger_convert/test-dropseqpipe-dev, sample=SRR1873277_S1_L001

export _JAVA_OPTIONS=-Djava.io.tmpdir=./tmp && DigitalExpression -m 10g        I=/home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/final.bam        O=/home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/umi/expression.tsv        EDIT_DISTANCE=1        OUTPUT_LONG_FORMAT=/home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/umi/expression.long        STRAND_STRATEGY=SENSE        OUTPUT_READS_INSTEAD=false        LOCUS_FUNCTION_LIST=null        LOCUS_FUNCTION_LIST={CODING,UTR}        MIN_BC_READ_THRESHOLD=0        CELL_BC_FILE=/home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/barcodes.csv
Activating conda environment: /home/tom/local/bin/dropSeqPipe-0.6/.snakemake/conda/f7468c03
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=./tmp

INFO    2020-06-12 18:29:19 DigitalExpression   

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    DigitalExpression -I /home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/final.bam -O /home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/umi/expression.tsv -EDIT_DISTANCE 1 -OUTPUT_LONG_FORMAT /home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/umi/expression.long -STRAND_STRATEGY SENSE -OUTPUT_READS_INSTEAD false -LOCUS_FUNCTION_LIST null -LOCUS_FUNCTION_LIST CODING -LOCUS_FUNCTION_LIST UTR -MIN_BC_READ_THRESHOLD 0 -CELL_BC_FILE /home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/barcodes.csv
**********

18:29:19.810 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/tom/local/bin/dropSeqPipe-0.6/.snakemake/conda/f7468c03/share/dropseq_tools-2.0.0-0/jar/lib/picard-2.18.14.jar!/com/intel/gkl/native/libgkl_compression.so
[Fri Jun 12 18:29:19 JST 2020] DigitalExpression OUTPUT_READS_INSTEAD=false OUTPUT=/home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/umi/expression.tsv OUTPUT_LONG_FORMAT=/home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/umi/expression.long INPUT=/home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/final.bam EDIT_DISTANCE=1 MIN_BC_READ_THRESHOLD=0 CELL_BC_FILE=/home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/barcodes.csv STRAND_STRATEGY=SENSE LOCUS_FUNCTION_LIST=[CODING, UTR]    CELL_BARCODE_TAG=XC MOLECULAR_BARCODE_TAG=XM READ_MQ=10 USE_STRAND_INFO=true RARE_UMI_FILTER_THRESHOLD=0.0 GENE_NAME_TAG=gn GENE_STRAND_TAG=gs GENE_FUNCTION_TAG=gf VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Fri Jun 12 18:29:19 JST 2020] Executing as tom@d4-lm3 on Linux 4.15.0-45-generic amd64; OpenJDK 64-Bit Server VM 11.0.1+13-LTS; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.0.0(1ef3a59_1539205128)
INFO    2020-06-12 18:29:19 BarcodeListRetrieval    *Found 0 cell barcodes in file*
ERROR   2020-06-12 18:29:19 DigitalExpression   Running digital expression without somehow selecting a set of barcodes to process no longer supported.
[Fri Jun 12 18:29:19 JST 2020] org.broadinstitute.dropseqrna.barnyard.DigitalExpression done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=2058354688
[Fri Jun 12 18:29:19 2020]
Error in rule extract_umi_expression:
    jobid: 41
    output: /home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/umi/expression.long, /home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/umi/expression.tsv
    conda-env: /home/tom/local/bin/dropSeqPipe-0.6/.snakemake/conda/f7468c03
    shell:
        export _JAVA_OPTIONS=-Djava.io.tmpdir=./tmp && DigitalExpression -m 10g        I=/home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/final.bam        O=/home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/umi/expression.tsv        EDIT_DISTANCE=1        OUTPUT_LONG_FORMAT=/home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/umi/expression.long        STRAND_STRATEGY=SENSE        OUTPUT_READS_INSTEAD=false        LOCUS_FUNCTION_LIST=null        LOCUS_FUNCTION_LIST={CODING,UTR}        MIN_BC_READ_THRESHOLD=0        CELL_BC_FILE=/home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/barcodes.csv
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

I'm guessing this is because the barcodes are parsed from the header rather than the 1st 8 bases of the R1. In this case, the code to fix the files is:

perl -0p -i -e "s/@(.*)\n(.{8})/@\1 1:N:0:\2\n\2/g"  SRR1873277_S1_L001_R1_001.fastq

Update 2 I can confirm this corrects the input format and runs to completion for the develop branch. I run into some issues running "rule repair" for version 0.4 (#72) or "violine_plot" for 0.5 (#91, #94) but I think these issues have already been discussed elsewhere. Version 0.5 seems to call the R instance from the PATH with Seurat 3 installed, instead of the conda version with Seurat 2. If anyone runs into this issue with DropSeqPipe version 0.5, a workaround is to add an older R version to the PATH that supports Seurat v2 (R 3.4.0 < version < 3.6.0 seems to work).

Hoohm / dropSeqPipe

First test with Dolomite Nadia data #101