CDPHE-bioinformatics / CDPHE-SARS-CoV-2

Workflows and scripts for the assembly and analysis of SARS-CoV-2 whole genome tiled amplicon sequencing.
https://cdphe-bioinformatics.github.io/CDPHE-SARS-CoV-2/
GNU General Public License v3.0
5 stars 0 forks source link

Rename WDL tasks to be more specific, standardized, or otherwise improved #22

Open danpolanco opened 7 months ago

danpolanco commented 7 months ago

Feature Request

This issue is a solicitation for feedback on an idea.

I've been working on the RSV pipeline and reusing tasks from this repo (i.e. CDPHE-SARS-CoV-2). I haven't renamed any of the tasks as I don't want to break consistency. I do however believe we could improve our task names.

Solution

There are a lot of possible task names so I put together a rough very diagram as a starting point for discussion:

graph TD
    subgraph Summary Workflow
    A3(Assembly Files) --> B3
    B3["Concatenate Fastas
        Current: concatenate
        Possibility: concatenate_fastas"] --> G3
    A3 --> D3
    D3["Concatenate Clades
        Current: concatentate_nextclade
        Possibility: concatenate_clades
        Possibility: summarize_clades"] --> G3
    A3 --> F3
    F3["Make Results Files
        Current: results_table
        Possibility: make_results_file
        Possibility: sumarize_results"] --> G3
    G3{{Summary Files}} --> H3
    H3["Transfer Summary Files
        Current: transfer_summary
        Possibility: transfer_summary_files"] --> I3
    I3{{Cloud Bucket}}
    end

    subgraph Assembly Workflow
    A1(Raw Reads) --> B1
    A1 --> C1
    B1["Clean Reads
        Current: seqyclean
        Possibility: clean_reads"] --> D1
    C1["QC Reads
        Current: fastqc
        Possibility: qc_reads"] --> N1
    D1["Align Reads
        Current: align_reads"] --> E1
    E1["Trim Reads
        Current: ivar_trim
        Possibility: trim_reads"] --> F1
    E1 --> G1
    E1 --> H1
    F1["Call Variants
        Current: ivar_var
        Possibility: call_variants"] --> N1
    G1["Make Consensus
        Current: ivar_consensus
        Possibility: make_consensus"] --> I1
    H1["Make BAM Stats
        Current: bam_stats
        Possibility: make_bam_stats"] --> N1
    I1["Rename Consensus
        Current: rename_fasta
        Possibility: rename_consensus"] --> J1
    I1 --> K1
    J1["Calculate Percent Coverage
        Current: calc_percent_cvg
        Possibility: make_percent_coverage
        Possibility: summarize_percent_coverage"] --> N1
    K1["Call Clades
        Current: nextclade
        Possibility: call_clades"] --> L1
    L1["Parse Clades
        Current: parse_nextclade
        Possibility: parse_clades"] --> N1
    N1(Assembly Files) --> O1
    O1("Transfer Assembly Files
        Current: transfer_assembly
        Possibility: transfer_assembly_files") --> P1
    P1{{Cloud Bucket}}
    end

Note I put this diagram together quickly and it doesn't reflect the current SARS-CoV-2 pipeline.

One point to consider, suggested by @arianna-smith, is to keep the tool name in the task name. For example, instead of just clean_reads some possibilities are:

Some other considerations are how we are:

All the examples given above are to generate discussion rather than suggest a hard requirement.

Upstream effects

I don't believe changing the tasks names will have any upstream effects.

Downstream effects

I don't believe changing the tasks names will have any downstream effects.

molly-hetheringtonrauth commented 7 months ago

We will add to the New version release milestone and tackle this post Silver Pancake.

danpolanco commented 5 months ago

As a group we decided on some changes we'd like to make.

SC2_illumina_pe_assembly.wdl

Variable Names

original new
sample_name sample_name
fastq_1 fastq_1
fastq_2 fastq_2
primer_bed primer_bed
adapters_and_contaminants contam_fasta
covid_genome sc2_ref_fasta
covid_gff sc2_ref_gff
scrub_reads scrub_reads
scrub_genome_index scrub_genome_index
project_name project_name
out_dir out_dir
seq_method seq_platform
s_gene_amplicons sc2_s_gene_amplicons_bed
calc_percent_coverage_py calc_percent_coverage_py
version_capture_py version_capture_py

Tasks

original new
hostile_task scrub_reads_hostile
seqyclean filter_reads_seqyclean
fastqc assess_quality_fastqc
align_reads align_reads_bwa
ivar_trim trim_primers_ivar
ivar_var call_variants_ivar
ivar_consensus call_consensus_ivar
bam_stats calc_bam_stats_samtools
rename_fasta rename_fasta
calc_percent_cvg calc_percent_coverage
version_capture capture_versions
transfer transfer_outputs

SC2_ont_assembly.wdl

Tasks

original new
ListFastqFiles list_fastqs
Demultiplex demultiplex_guppy
concatenate_fastqs concatenate_fastqs
Read_Filtering filter_reads_guppyplex
Medaka call_artic_minion_medaka
exit_wdl exit_wdl
Bam_stats calc_bam_stats_samtools
Scaffold scaffold_pyscaf
rename_fasta rename_fasta
calc_percent_cvg calc_percent_coverage
get_primer_site_variants get_primer_variants_bcftools
transfer transfer_outputs
hostile_task scrub_reads_hostile
version_capture capture_versions

Variables

original new
gcs_fastq_dir fastq_dir
sample_name sample_name
index_1_id barcode_id
primer_set Remove and create max_read_length variable with default set to 700
barcode_kit barcode_kit
medaka_model medaka_model
scrub_reads scrub_reads
Scrub_genome_in- dex scrub_genome_index
covid_genome sc2_ref_fasta
primer_bed primer_bed
s_gene_primer_bed sc2_s_gene_amplicons_bed
s_gene_amplicons sc2_s_gene_amplicons_tsv
project_name project_name
calc_percent_coverage_py calc_percent_coverage_py
version_capture_py version_capture_py
out_dir out_dir

SC2_lineage_calling_and_results.wdl

Tasks

original new
concatentate concatenate_fastas
pangolin assign_lineage_pangolin
nextclade assign_clade_nextclade
version_capture capture_workflow_versions
parse_nextclade parse_nextclade
results_table summarize_results
create_version_capture_file capture_task_versions
transfer transfer_outputs

Variables

Staying the same as before.

danpolanco commented 5 months ago

We should also consider how we organizing tasks. For example, if we change to use transfer_outputs for assembly and summary, we can't keep those in the same file (e.g. transfer_tasks.wdl) as the name conflicts.

I'm trialing the following in CDPHE-bioinformatics/CDPHE-RSV#4:

Additionally, using call_ as a prefix might be confusing. At least with miniwdl, it uses call-:

image

But that seems fairly minor?