Error on split_species rule for shipped test data

seb-mueller commented 6 years ago

Not sure if this is due to my setup, but I'm getting regularly an error on the split_species rule.

To make it reproducible, I've tried it on the shipped data in .test:

Running the following in the .test folder of a freshly cloned (including submodules) repository:

~/code/dropSeqPipe/.test  (master *)$ snakemake --use-conda --cores 3 --snakefile ../Snakefile split_species

Gives the following

...
Activating conda environment: /home/sm934/code/dropSeqPipe/.test/.snakemake/conda/f358d867
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/tmp
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/tmp
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/tmp
[Tue Aug 14 13:01:43 BST 2018] org.broadinstitute.dropseqrna.barnyard.DigitalExpression SUMMARY=summary/SPECIES_TWO/sample2_dge.summary.txt OUTPUT=summary/SPECIES_TWO/sample2_unfiltered_umi_expression_matrix.tsv INPUT=data/SPECIES_TWO/sample2_unfiltered.bam EDIT_DISTANCE=1 MIN_BC_READ_THRESHOLD=0 NUM_CORE_BARCODES=100    OUTPUT_READS_INSTEAD=false CELL_BARCODE_TAG=XC MOLECULAR_BARCODE_TAG=XM GENE_EXON_TAG=GE STRAND_TAG=GS READ_MQ=10 USE_STRAND_INFO=true RARE_UMI_FILTER_THRESHOLD=0.0 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Tue Aug 14 13:01:43 BST 2018] Executing as sm934@genesilencing56 on Linux 4.15.0-30-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_121-b15; Picard version: 1.13(7bed8f4_1513008033)
[Tue Aug 14 13:01:43 BST 2018] org.broadinstitute.dropseqrna.barnyard.DigitalExpression SUMMARY=summary/SPECIES_TWO/sample1_dge.summary.txt OUTPUT=summary/SPECIES_TWO/sample1_unfiltered_umi_expression_matrix.tsv INPUT=data/SPECIES_TWO/sample1_unfiltered.bam EDIT_DISTANCE=1 MIN_BC_READ_THRESHOLD=0 NUM_CORE_BARCODES=100    OUTPUT_READS_INSTEAD=false CELL_BARCODE_TAG=XC MOLECULAR_BARCODE_TAG=XM GENE_EXON_TAG=GE STRAND_TAG=GS READ_MQ=10 USE_STRAND_INFO=true RARE_UMI_FILTER_THRESHOLD=0.0 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Tue Aug 14 13:01:43 BST 2018] org.broadinstitute.dropseqrna.barnyard.DigitalExpression SUMMARY=summary/SPECIES_ONE/sample2_dge.summary.txt OUTPUT=summary/SPECIES_ONE/sample2_unfiltered_umi_expression_matrix.tsv INPUT=data/SPECIES_ONE/sample2_unfiltered.bam EDIT_DISTANCE=1 MIN_BC_READ_THRESHOLD=0 NUM_CORE_BARCODES=100    OUTPUT_READS_INSTEAD=false CELL_BARCODE_TAG=XC MOLECULAR_BARCODE_TAG=XM GENE_EXON_TAG=GE STRAND_TAG=GS READ_MQ=10 USE_STRAND_INFO=true RARE_UMI_FILTER_THRESHOLD=0.0 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Tue Aug 14 13:01:43 BST 2018] Executing as sm934@genesilencing56 on Linux 4.15.0-30-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_121-b15; Picard version: 1.13(7bed8f4_1513008033)
[Tue Aug 14 13:01:43 BST 2018] Executing as sm934@genesilencing56 on Linux 4.15.0-30-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_121-b15; Picard version: 1.13(7bed8f4_1513008033)
INFO    2018-08-14 13:01:43     BarcodeListRetrieval    Looking for the top 100 cell barcodes
INFO    2018-08-14 13:01:43     BarcodeListRetrieval    Selected 0 core barcodes
ERROR   2018-08-14 13:01:43     DigitalExpression       Running digital expression without somehow selecting a set of barcodes to process no longer supported.
[Tue Aug 14 13:01:43 BST 2018] org.broadinstitute.dropseqrna.barnyard.DigitalExpression done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=504889344
INFO    2018-08-14 13:01:43     BarcodeListRetrieval    Looking for the top 100 cell barcodes
INFO    2018-08-14 13:01:43     BarcodeListRetrieval    Selected 0 core barcodes
ERROR   2018-08-14 13:01:43     DigitalExpression       Running digital expression without somehow selecting a set of barcodes to process no longer supported.
[Tue Aug 14 13:01:43 BST 2018] org.broadinstitute.dropseqrna.barnyard.DigitalExpression done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=504889344
INFO    2018-08-14 13:01:43     BarcodeListRetrieval    Looking for the top 100 cell barcodes
INFO    2018-08-14 13:01:43     BarcodeListRetrieval    Selected 0 core barcodes
ERROR   2018-08-14 13:01:43     DigitalExpression       Running digital expression without somehow selecting a set of barcodes to process no longer supported.
[Tue Aug 14 13:01:43 BST 2018] org.broadinstitute.dropseqrna.barnyard.DigitalExpression done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=504889344
    [Tue Aug 14 13:01:43 2018]
    Error in rule extract_all_umi_expression_species:
        jobid: 8
    [Tue Aug 14 13:01:43 2018]
    [Tue Aug 14 13:01:43 2018]
        output: summary/SPECIES_TWO/sample1_unfiltered_umi_expression_matrix.tsv, summary/SPECIES_TWO/sample1_dge.summary.txt
    Error in rule extract_all_umi_expression_species:
    Error in rule extract_all_umi_expression_species:
        conda-env: /home/sm934/code/dropSeqPipe/.test/.snakemake/conda/f358d867
        jobid: 9
        jobid: 10

        output: summary/SPECIES_ONE/sample2_unfiltered_umi_expression_matrix.tsv, summary/SPECIES_ONE/sample2_dge.summary.txt
        output: summary/SPECIES_TWO/sample2_unfiltered_umi_expression_matrix.tsv, summary/SPECIES_TWO/sample2_dge.summary.txt
        conda-env: /home/sm934/code/dropSeqPipe/.test/.snakemake/conda/f358d867
        conda-env: /home/sm934/code/dropSeqPipe/.test/.snakemake/conda/f358d867
...

Does this work for you? Note, the map and filter rules etc. work just fine.

I suspect this is due to the absence of the second species in the reference files (only chr19). However the data-set seems a mixed_species.

I guess the reference has to be adapted to make the split_species rule testable. Thoughts?

Hoohm commented 6 years ago

Yeah that's probably the problem. I thought of adding other combination of test/refs to the repo. The idea would be to run all possibilities to test the pipeline properly for all cases.

Simple run
Double species
Simple run with barcode whitelist
Double reference with barcode whitelist

seb-mueller commented 6 years ago

Agreed. I suspect it is easy to modify the reference genome to account for 2 species. Either append chr19 from mouse (plus renaming), or just split it in half with the second half being "mouse" genome to save space (but loose scientific accuracy which is fine for technical testing)

Hoohm commented 6 years ago

I think it's a bit more complicated because of the gtf file.

Maybe we should start with a dual reference and use only this for all cases.

Hoohm commented 5 years ago

Working on automating the download and metada generation completely with the new version. Will deal with double species cases

Hoohm commented 5 years ago

With the new version it works as expected. Although I'm not testing this on Travis but only locally. The test data is now actually a sample of a mixed human mouse.

We can implement multiple tests if we want but I'm not sure it makes sense since the mixed experiments are mostly used to test the doublet rates of a platform. I'm not sure people will make PRs for the mixed and that it needs to be tested online.

What do you think?

Hoohm / dropSeqPipe

Error on split_species rule for shipped test data #44