Rerun with changed expected_cells in samples.csv requires remapping

seb-mueller commented 5 years ago

I often need to change the expected_cells in samples.csv and want snakemake to update all result depending on it, but no more than that.

This seems common and I think we need a clean solution for this. It has been raised already in issue #28, but since the pipeline has changed and I think that solution is not valid anymore, I'd rather have a separate issue.

To approach this systematically, I first tried to find out when expected_cells is read by rip-greping the code base:

rg -B 10 -A 3 "wildcards.sample,'expected_cells'"

Results (Note, the lines with hits have a green highlighted number with colon):

rules/extract_expression_single.smk
7-
8-rule extract_umi_expression:
9-    input:
10-        data='{results_dir}/samples/{sample}/final.bam',
11-        barcode_whitelist='{results_dir}/samples/{sample}/barcodes.csv'
12-    output:
13-        long='{results_dir}/samples/{sample}/umi/expression.long',
14-        dense=temp('{results_dir}/samples/{sample}/umi/expression.tsv')
15-    params:
16-        count_per_umi=config['EXTRACTION']['minimum-counts-per-UMI'],
17:        num_cells=lambda wildcards: int(samples.loc[wildcards.sample,'expected_cells']),
18-        umiBarcodeEditDistance=config['EXTRACTION']['UMI-edit-distance'],
19-        temp_directory=config['LOCAL']['temp-directory'],
20-        memory=config['LOCAL']['memory'],
--
35-
36-rule extract_reads_expression:
37-    input:
38-        data='{results_dir}/samples/{sample}/final.bam',
39-        barcode_whitelist='{results_dir}/samples/{sample}/barcodes.csv'
40-    output:
41-        long='{results_dir}/samples/{sample}/read/expression.long',
42-        dense=temp('{results_dir}/samples/{sample}/read/expression.tsv')
43-    params:
44-        count_per_umi=config['EXTRACTION']['minimum-counts-per-UMI'],
45:        num_cells=lambda wildcards: int(samples.loc[wildcards.sample,'expected_cells']),
46-        umiBarcodeEditDistance=config['EXTRACTION']['UMI-edit-distance'],
47-        temp_directory=config['LOCAL']['temp-directory'],
48-        memory=config['LOCAL']['memory'],

rules/split_species.smk
47-
48-
49-rule plot_barnyard:
50-    input:
51-        expand('{results_dir}/samples/{{sample}}/{species}/dge.summary.txt',species=config['META']['species'], results_dir=results_dir)
52-    output: 
53-        genes_pdf='{results_dir}/plots/barnyard/{sample}_genes.pdf',
54-        transcripts_pdf='{results_dir}/plots/barnyard/{sample}_transcripts.pdf',
55-        barcodes_species=expand('{{results_dir}}/samples/{{sample}}/{species}/barcodes.csv', species=species_list)
56-    params:
57:        expected_cells=lambda wildcards: int(samples.loc[wildcards.sample,'expected_cells'])
58-    script: 
59-        '../scripts/plot_species_plot.R'

rules/extract_expression_species.smk
7-
8-rule extract_umi_expression_species:
9-    input:
10-        data='{results_dir}/samples/{sample}/{species}/unfiltered.bam',
11-        barcode_whitelist='{results_dir}/samples/{sample}/{species}/barcodes.csv'
12-    output:
13-        dense=temp('{results_dir}/samples/{sample}/{species}/umi/expression.txt'),
14-        long='{results_dir}/samples/{sample}/{species}/umi/expression.long'
15-    params:
16-        count_per_umi=config['EXTRACTION']['minimum-counts-per-UMI'],
17:        num_cells=lambda wildcards: int(samples.loc[wildcards.sample,'expected_cells']),
18-        umiBarcodeEditDistance=config['EXTRACTION']['UMI-edit-distance'],
19-        temp_directory=config['LOCAL']['temp-directory'],
20-        memory=config['LOCAL']['memory'],
--
33-        MIN_BC_READ_THRESHOLD={params.count_per_umi}\
34-        CELL_BC_FILE={input.barcode_whitelist}"""
35-
36-
37-rule extract_reads_expression_species:
38-    input:
39-        data='{results_dir}/samples/{sample}/{species}/unfiltered.bam',
40-        barcode_whitelist='{results_dir}/samples/{sample}/{species}/barcodes.csv'
41-    params:
42-        count_per_umi=config['EXTRACTION']['minimum-counts-per-UMI'],
43:        num_cells=lambda wildcards: int(samples.loc[wildcards.sample,'expected_cells']),
44-        umiBarcodeEditDistance=config['EXTRACTION']['UMI-edit-distance'],
45-        temp_directory=config['LOCAL']['temp-directory'],
46-        memory=config['LOCAL']['memory'],
--
82-            ref_path=ref_path,
83-            release=release,
84-            species=species,
85-            build=build),
86-        rRNA_intervals=expand("{ref_path}/{species}_{build}_{release}/annotation.rRNA.intervals",
87-            ref_path=ref_path,
88-            release=release,
89-            build=build,
90-            species=species)
91-    params:
92:        cells=lambda wildcards: int(samples.loc[wildcards.sample,'expected_cells']),        
93-        memory=config['LOCAL']['memory'],
94-        temp_directory=config['LOCAL']['temp-directory']
95-    output:

rules/cell_barcodes.smk
21-
22-rule get_top_barcodes:
23-    input:
24-        '{results_dir}/samples/{sample}/trimmmed_repaired_R1.fastq.gz'
25-    output:
26-        '{results_dir}/samples/{sample}/top_barcodes.csv'
27-    conda: '../envs/umi_tools.yaml'
28-    params:
29-        cell_barcode_length=(config['FILTER']['cell-barcode']['end'] - config['FILTER']['cell-barcode']['start'] + 1),
30-        umi_barcode_length=(config['FILTER']['UMI-barcode']['end'] - config['FILTER']['UMI-barcode']['start'] + 1),
31:        num_cells=lambda wildcards: round(int(samples.loc[wildcards.sample,'expected_cells'])*1.2),
32-    shell:
33-        """umi_tools whitelist\
34-        --stdin {input}\

rules/map.smk
157-        """
158-
159-rule bead_errors_metrics:
160-    input:
161-        '{results_dir}/samples/{sample}/gene_exon_tagged_bead_sub.bam'
162-    output:
163-        '{results_dir}/samples/{sample}/final.bam'
164-    params:
165-        out_stats='{results_dir}/logs/dropseq_tools/{sample}_synthesis_stats.txt',
166-        summary='{results_dir}/logs/dropseq_tools/{sample}_synthesis_stats_summary.txt',
167:        barcodes=lambda wildcards: int(samples.loc[wildcards.sample,'expected_cells']) * 2,
168-        memory =config['LOCAL']['memory'],
169-        SmartAdapter=config['FILTER']['5-prime-smart-adapter'],
170-        temp_directory=config['LOCAL']['temp-directory']
--
216-        pdf='{results_dir}/plots/yield.pdf'
217-    script:
218-        '../scripts/plot_yield.R'
219-
220-
221-rule plot_knee_plot:
222-    input:
223-        data='{results_dir}/logs/dropseq_tools/{sample}_hist_out_cell.txt',
224-        barcodes='{results_dir}/samples/{sample}/barcodes.csv'
225-    params:
226:        cells=lambda wildcards: int(samples.loc[wildcards.sample,'expected_cells'])
227-    conda: '../envs/plots.yaml'
228-    output:
229-        pdf='{results_dir}/plots/knee_plots/{sample}_knee_plot.pdf'

Checking the DAG, I found that rule get_top_barcodes was the highest up, so figured if I delete it's output, this should be rerun and consequently everything down the line.

https://github.com/Hoohm/dropSeqPipe/blob/622be9464bf141d0b6803f85178e653ecca5e3ea/rules/cell_barcodes.smk#L31

This in fact worked well by deleting '{results_dir}/samples/{sample}/top_barcodes.csv' and rerunning snakemake :

snakemake --use-conda -rp --conda-prefix ~/.conda/myevns --cores 32 --directory ~/analysis/dropseq/data/project

Note, at this point I've modified the Snakefile to have '{results_dir}/samples/{sample}/top_barcodes.csv' in the target rules!

Job counts:
        count   jobs
        1       DetectBeadSubstitutionErrors
        1       MergeBamAlignment
        1       STAR_align
        2       SingleCellRnaSeqMetricsCollector_species
        1       TagReadWithGeneExon
        1       all
        1       bam_hist
        1       bead_errors_metrics
        4       convert_long_to_mtx_species
        1       extend_barcode_top
        2       extract_all_umi_expression
        2       extract_reads_expression_species
        2       extract_umi_expression_species
        1       get_cell_whitelist
        1       get_top_barcodes
        1       multiqc_star
        1       pigz_unmapped
        1       plot_barnyard
        1       plot_knee_plot
        2       plot_rna_metrics_species
        1       plot_yield
        1       repair_barcodes
        2       split_bam_species
        32

But STAR-mapping seems to be triggered again! Surely that's a wasteful way to do it. I worked out this is because of the repair_barcodes rule: https://github.com/Hoohm/dropSeqPipe/blob/622be9464bf141d0b6803f85178e653ecca5e3ea/rules/cell_barcodes.smk#L60 which relies on expected_cells, barcode information as well but also on temp('{results_dir}/samples/{sample}/Aligned.repaired.bam') which is only generated temporary! That's why the mapping gets probably kicked of to regenerated.

To overcome this, I've created a branch where this should be fixed by taking out the temp bit, I'm testing at the moment and get back. Hope I didn't miss anything really simple :)

Note

seb-mueller commented 5 years ago

Addition: I've created a view of the DAG and circled the rules that depend the expected_cells parameter:

dag

seb-mueller commented 5 years ago

Added this branch which seems to fix the issue: https://github.com/Hoohm/dropSeqPipe/tree/feature/split_species_fix

Suggest workflow:

change expected_cells in samples.csv
remove or backup all top_barcodes.csv In zsh this can be done like this ls **/*top_barcodes.csv | xargs -t -I{} -p mv {} {}.bak
run snakemake snakemake --use-conda -rp --conda-prefix ~/.conda/myevns --cores 32 --directory ~/analysis/dropseq/data/project

seb-mueller commented 5 years ago

Update, @Hoohm mentioned the --notemp flag of snakemake, which can be used to prevent deleting temporay files. In that case I could undo this line https://github.com/Hoohm/dropSeqPipe/blob/63b3dc520fa4fb54f37c539220fa554de03437dd/rules/map.smk#L99 marking it as temp again.

However all the other changes are still valid in my mind, @Hoohm, could you check this?

Hoohm commented 4 years ago

Yep, I'm closing this

Hoohm / dropSeqPipe

Rerun with changed expected_cells in samples.csv requires remapping #77