EnzymeFunctionInitiative / EST

Programs for creating Sequence Similarity Networks and wrappers for pipeline submission on a torque cluster
GNU General Public License v3.0
6 stars 4 forks source link

Add pipeline DAGs to docs #73

Open 1ndy opened 3 months ago

1ndy commented 3 months ago

Nextflow has the ability to create a visual representation of a pipeline. This might help someone understand how the tool operates. Generate a DAG diagram for each pipeline and include it with the Sphinx documentation. There are a variety of output formats to choose from; try the mmd output option with the sphinx plugin for mermaid diagrams and if it does not work well, see if the HTML version can be included in the docs. If neither of those options work well, render an SVG or PNG version of the DAG. Pick an appropriate location to store these files.

Within the docs, these images should fit well on the index page for the their respective pipelines but feel free to explore other options.

An example command for generating the DAG would be:

nextflow -C conf/est/docker.config run pipelines/est/est.nf -preview -params-file params.yml -with-dag assets/est.mmd

Current pipelines to render images for:

Finally, consider adding to the docs-html Makefile target a command which renders these pipeline diagrams every time the documentation is built. This will ensure that they are always up-to-date.

rbdavid commented 3 weeks ago

A quick implementation of the -with-dag parameter for module file 01_est_sequence_blast.sh. As far as I can tell, the resulting mermaid graph is not immediately portable into the documentation. See below:

flowchart TB
    v0([get_sequence_ids])
    subgraph " "
    v1[" "]
    v2[" "]
    v3[" "]
    v4[" "]
    v18[" "]
    v19[" "]
    v21[" "]
    end
    v5([split_sequence_ids])
    v7([get_sequences])
    v9([cat_fasta_files])
    v10([create_blast_db])
    v11([blastreduce_transcode_fasta])
    v12([split_fasta])
    v14([all_by_all_blast])
    v16([blastreduce])
    v17([compute_stats])
    v20([visualize])
    v6(( ))
    v8(( ))
    v13(( ))
    v15(( ))
    v0 --> v5
    v0 --> v4
    v0 --> v3
    v0 --> v2
    v0 --> v1
    v5 --> v6
    v6 --> v7
    v7 --> v8
    v8 --> v9
    v9 --> v10
    v9 --> v11
    v9 --> v12
    v10 --> v14
    v11 --> v16
    v11 --> v17
    v12 --> v13
    v13 --> v14
    v14 --> v15
    v15 --> v16
    v16 --> v17
    v17 --> v20
    v17 --> v19
    v17 --> v18
    v20 --> v21

By default, the file type written by -with-dag is an html file (great! I can copy the body into this message. very nice). BUT, since the single test only goes through one branch of the est.nf pipeline, the DAG doesn't include the other branches' flow(s). So, we can't automatically port one test's DAG into the docs or, at least, not as I imagined.

Also, these DAGs don't include much in the way of information for human readers; branches in the workflows won't include labels (e.g. sequence blast vs fasta vs family vs accession and the respective A,B,C,D notation used by devs). Outputs from the workflows are just empty boxes, which is zero or negative information content. Much left to be desired.

rbdavid commented 3 weeks ago

Oh, on visual inspection of the actual steps in the above DAG, nextflow is not creating a correct graph for the "blast" branch of the est.nf file. I'm not sure if an automated parsing of nextflow-created DAGs for each branch would even be worthwhile since inaccuracies would be hard to detect.

rbdavid commented 3 weeks ago
---
title: EST
---
flowchart TB
    v0([Input parameters defined in params.yml])
    v0a((if params.import_mode == 'fasta'))
    v0b((else))
    v1([import_fasta])
    v0 --> v0a
    v0 --> v0b
    subgraph " "
    v0a --> v1
    end
    subgraph " "
    v2([get_sequence_ids])
    v3([split_sequence_ids])
    v4([get_sequences])
    v5([cat_fasta_files])
    v0b --> v2
    v2 --> v3
    v3 --> v4
    v4 --> v5
    end
    v6((if params.multiplex))
    v7([multiplex])
    v8([create_blast_db])
    v1 --> v6
    v1 --> v8
    v5 --> v6
    v6 --> v7
    v7 --> v8
    v5 --> v8
    v9([blastreduce_transcode_fasta])
    v8 --> v9
    v10([split_fasta])
    v11([all_by_all_blast])
    v12([blastreduce])
    v9 --> v10
    v10 --> v11
    v11 --> v12
    v13((if params.multiplex))
    v14([demultiplex])
    v15([compute_stats])
    v12 --> v13
    v13 --> v14
    v14 --> v15
    v12 --> v15
    v16([visualize])
    v15 --> v16

Created this by hand. Pretty easy to edit now that the backbone is ready.

rbdavid commented 3 weeks ago

I'm doing a bit of fine-tuning. This is certainly not something that the automated nextflow -with-dag parameter will output.

rbdavid commented 2 weeks ago

Here's the updated visualization for the est.nf pipeline:

---
config:
    look: classic
    theme: forest
---
flowchart TB
    start((start))
    v0[\Input parameters defined in params.yml\]
    start --> v0
    subgraph " "
    v0a{if params.import_mode == 'fasta'}
    v1(import_fasta)
    v2(get_sequence_ids)
    v3(split_sequence_ids)
    v4(get_sequences)
    v5(cat_fasta_files)
    v0 --> v0a
    v0a -->|true| v1
    v0a -->|false| v2
    v2 --> v3
    v3 --> v4
    v4 --> v5
    end
    subgraph " "
    v6{if params.multiplex}
    v7(multiplex)
    v1 --> v6
    v5 --> v6
    v6 -->|true| v7
    end
    subgraph " "
    v8(create_blast_db)
    v9(blastreduce_transcode_fasta)
    v7 --> v8
    v6 -->|false| v8
    v8 --> v9
    end
    subgraph " "
    v10(split_fasta)
    v11(all_by_all_blast)
    v12(blastreduce)
    v9 --> v10
    v10 --> v11
    v11 --> v12
    end
    subgraph " "
    v13{if params.multiplex}
    v14(demultiplex)
    v12 --> v13
    v13 -->|true| v14
    end
    v15(compute_stats)
    v16(visualize)
    v14 --> v15
    v13 -->|false| v15
    v15 --> v16

    subgraph " "
    v17[/Graphs:
    pid vs aln score
    aln len vs aln score
    .../]
    v18[/1.out.parquet/]
    v19[/boxplot_stats.parquet
    evalue.tab
    acc_counts.json/]
    end
    v16 --> v17
    v14 --> v18
    v12 --> v18
    v15 --> v19

I'm fairly sure I haven't listed all of the output files from the steps. I'll get back to finishing this and making the other pipelines once I've got a handle on my KBase tasks.