fairagro / M4.4_UC6_ARC

UC6 workflow in ARC proof-of-concept
0 stars 0 forks source link

CWLMake #9

Closed hvwaldow closed 1 month ago

hvwaldow commented 3 months ago
JensKrumsieck commented 3 months ago

prior art besides https://github.com/tom-tan/zatsu-cwl-generator (commit 2 years ago) there is also :

a tool for visual editing of CWL Files Rabix Composer (commit 3 years ago) which uses CWL-SVG (commit 1 year ago). Composer is now developed as closed source tool for sbgenomics.

and various libraries for Python, JS/TS, C++, C#/F#, D, Java, R - most of them being auto generated from the specs

JensKrumsieck commented 3 months ago

Comparison between workflow languages executing a simple python script with a string parameter and writing back an output file:

CWL CWL (alternative)
```cwl cwlVersion: v1.2 class: CommandLineTool requirements: InitialWorkDirRequirement: listing: - entryname: print.py entry: $include: ./print.py inputs: message: type: string default: Hello World inputBinding: position: 1 outputs: output_file: type: File outputBinding: glob: helloworld.txt baseCommand: [python3, ./print.py] ``` ```cwl cwlVersion: v1.2 class: CommandLineTool inputs: file: type: File default: class: File path: ./print.py inputBinding: position: 1 message: type: string default: Hello World inputBinding: position: 2 outputs: output_file: type: File outputBinding: glob: helloworld.txt baseCommand: python3 ```
Nextflow SnakeMake
```nextflow process helloworld{ input: val greeting output: file 'helloworld.txt' script: """ python3 ${projectDir}/print.py \"${greeting}\" """ } params.greeting = "Hello World" workflow { helloworld(params.greeting) } ``` ```snakemake rule HelloWorld: input: thefile="input.txt" output: "helloworld.txt" shell: "greeting=\"$(cat {input.thefile})\" && " "python3 ./print.py \"$greeting\"" ```
hvwaldow commented 2 months ago
JensKrumsieck commented 2 months ago

Currently the CommandLine Tools are created with a set of Python Scripts doing some Regex with the R files, combined with some hints in the comments of those files which was the interim solution to complete

JensKrumsieck commented 2 months ago

One could also use libraries such as BaklavaJS als node graph frontend to create Workflows. Similar to the deprecated Composer App ...

I did a quick (local) prototype yesterday. It loads CWL CommandLineTools from Disk (using FileSystemHandle API) and adds them as Nodes. One could export the graph as Workflow i think - this is not implemented though. proto

There is also CWL-SVG which is used by the Composer App but there are issues from 2018 still open with no answer and the standalone sample does not work anymore due to a package registry not existing anymore.

JensKrumsieck commented 2 months ago

See

JensKrumsieck commented 2 months ago

There also is WDL (Workflow Description Language) for which converters to CWL seem to already exist( https://github.com/common-workflow-lab/wdl-cwl-translator (Last commit yesterday)) and there is a huge amount of tools available in Dockstore. Both CWL and WDL are supported by the toil-Runner: https://toil.readthedocs.io/en/latest/ OpenWDL released Version 1.2 of their spec earlier this year. Syntax looks like if CWL and Nextflow had children^^ 🤔

version 1.2

task hello_task {
  input {
    File infile
    String pattern
  }

  command <<<
    grep -E '~{pattern}' '~{infile}'
  >>>

  requirements {
    container: "ubuntu:latest"
  }

  output {
    Array[String] matches = read_lines(stdout())
  }
}

workflow hello {
  input {
    File infile
    String pattern
  }

  call hello_task {
    infile, pattern
  }

  output {
    Array[String] matches = hello_task.matches
  }
}
hvwaldow commented 2 months ago

WDL asl intermediate Sprache?

https://github.com/common-workflow-lab/wdl-cwl-translator

hvwaldow commented 2 months ago

WDL oder Nextflow?

Jens macht pro- und konta-Liste. Entscheidung auch vor nächstem Meeting.

JensKrumsieck commented 1 month ago

CWL vs WDL vs Nextflow

tl;dr: CWL is [suboptimal/verbose/not used] but seems to be the best tool for our usecase

Numbers

CWL WDL Nextflow
GitHub
# of GitHub Repos 1k 1k 5k
# of GitHub Users 163 249 1k
# of GitHub Stars 1.4k 759 2.6k
# of contributors 65 51 170
last commit to main spec repo last year (2 weeks to spec 1.2 repo) 3 months 2 days
License Apache 2.0 BSD 3-Clause Apache 2.0
Entries on...
... WorkflowHub 81 12 129
... Dockstore 226 3245 129
... nf-core 0 0 97

CWL has common BioTools at https://github.com/common-workflow-library/bio-cwl-tools

Who?

CWL Community Driven with Governance Comitee (Members from Arvados, Sevenbridges Genomics, University of Manchester, ... + 1 Galaxy & 1 WDL Member)
WDL: Community Driven with Governance Comitee (Members from Chan Zuckerberg Initiative, Microsoft, Amazon, Broad Institute, DNAStack, ...)
Nextflow: Sequera Labs, Centre for Genomic Regulation; (Funding: Chan Zuckerberg Initiative, Sequera)

Hello World Workflow (Syntax comparison)

CWL WDL Nextflow
```cwl cwlVersion: v1.2 class: CommandLineTool baseCommand: echo inputs: message: type: string default: "Hello World" inputBinding: position: 1 outputs: [] ``` ```WDL version 1.0 workflow HelloWorld { call WriteGreeting } task WriteGreeting { command { echo "Hello World" } output { File output_greeting = stdout() } } ``` ```Nextflow params.str = 'Hello World' process greeting { input: val greeting output: stdout """ echo ${greeting} """ } workflow { greeting(params.str) } ```

Calling a Script (Syntax comparison)

CWL WDL Nextflow
```cwl cwlVersion: v1.2 class: CommandLineTool requirements: InitialWorkDirRequirement: listing: - entryname: print.py entry: $include: ./print.py inputs: message: type: string default: Hello World inputBinding: position: 1 outputs: messages: type: stdout baseCommand: [python3, ./print.py] ``` ```WDL version 1.1 task greeting { input { String the_input File the_file } command { python ~{the_file} ~{the_input} } output { File result = stdout() } runtime { container: "python:latest" } } workflow HelloWF { input { String the_input File the_file = "print.py" } call greeting { input: the_input = the_input, the_file = the_file } output { } } ``` ```Nextflow process helloworld{ input: val greeting output: file 'helloworld.txt' script: """ python3 ${projectDir}/print.py \"${greeting}\" """ } params.greeting = "Hello World" workflow { helloworld(params.greeting) } ```

The local runner "miniWDL" does not support 1.2 as of now! ⛔ As far as i can see you can not use a local file without sending it as parameter... unless it is part of the container. This is true for WDL and Nextflow! Could be a dealbreaker! ⛔ One has to add a config file to make it work with the parameter being set to default (error thrown by miniWDL)⛔

[file_io] 
allow_any_input = true

The WDL2CWL Translator works and outputs this file which looks suboptimal but works. But this is a very simple WDL script! There are some test cases in the Translator Repo which are more complicated...

WDLCWL
```WDL version 1.1 task greeting { input { String the_input File the_file } command { python ~{the_file} ~{the_input} } output { File result = stdout() } runtime { container: "python:latest" } } workflow HelloWF { input { String the_input File the_file = "print.py" } call greeting { input: the_input = the_input, the_file = the_file } output { } } ``` ```CWL cwlVersion: v1.2 id: HelloWF class: Workflow requirements: - class: InlineJavascriptRequirement inputs: - id: the_input type: string - id: the_file default: class: File path: print.py type: File steps: - id: greeting in: - id: the_input source: the_input - id: the_file source: the_file out: - id: result run: class: CommandLineTool id: greeting inputs: - id: the_input type: string - id: the_file type: File outputs: - id: result type: stdout requirements: - class: InitialWorkDirRequirement listing: - entryname: script.bash entry: |4 python $(inputs.the_file.path) $(inputs.the_input) - class: InlineJavascriptRequirement - class: NetworkAccess networkAccess: true hints: - class: ResourceRequirement outdirMin: 1024 cwlVersion: v1.2 baseCommand: - bash - script.bash outputs: [] ```

I also asked ChatGPT to wrap the code into a CWL CommandLineTool using Docker which in this case worked suprisingly well... Only issue the file "output.txt" is not copied back to the local dir when used like this, i had to change a very small bit

ChatGPTWorks as expectedOriginal File from above (not using docker)
```cwl cwlVersion: v1.0 class: CommandLineTool inputs: input_string: type: string inputBinding: position: 1 outputs: [] stdout: output.txt baseCommand: python arguments: - -c - | import sys print(sys.argv[1]) hints: DockerRequirement: dockerPull: python:3.9 ``` ```cwl cwlVersion: v1.0 class: CommandLineTool inputs: input_string: type: string inputBinding: position: 1 outputs: output.txt: type: stdout baseCommand: python arguments: - -c - | import sys print(sys.argv[1]) hints: DockerRequirement: dockerPull: python:3.9 ``` ```cwl cwlVersion: v1.2 class: CommandLineTool requirements: InitialWorkDirRequirement: listing: - entryname: print.py entry: $include: ./print.py inputs: message: type: string default: Hello World inputBinding: position: 1 outputs: messages: type: stdout baseCommand: [python3, ./print.py] ```

Target format should still be CWL as this is what is accepted in the community. However due to the Translator for WDL being available one could encourage users to also write WDL. However a single file is produced not being able to mix and match individual CWL-CommandLineTools without manually splitting the file. A Nextflow converter is not available as Nextflow is way more powerful. One could implement it for a subset of features as the typical use case seems to be script execution as there are no widespread tools like in the Bioinformatics fields.

Looking at the r/bioinformatics subreddit it looks like Nextflow is the only one of this three languages that is adopted widely enough to have recent threads about it.

One could still consider GUI Tools like Rabix Composer for CWL (which is deprecated) as this is what HELIPORT seems to uses judging from Screenshots or use templates for these special use cases.

Other Consortia

NFDI4Biodiv however plans to use Nextflow regarding to their latest proposal. This is what is supported by CloWM (developed by NFDI4Microbiota). Whereas DataPLANT uses/wants CWL for ARCs.

Note: There is a requirements document (Requirements on workflow tools) from NFDI4Ing available: https://nfdi4ingscientificworkflowrequirements.readthedocs.io/en/latest/docs/requirements.html#evaluation

Opinionated Pro/Contra List

CWL WDL Nextflow
General:
Verbosity CWL is mega verbose
Documentation 🔘 🔘 overall ok
Script Execution in Docker works best in CWL "requirements" can be cool
Working with containers works with all, is default in WDL
Output into filesystem CWL outputs req. files, others spam logs into fs
Speed 🔘 🔘 🔘 all about the same
Parsable 🔘 🔘 🔘 = Grammar available, CWL=YAML
Simplicity:
official GUI Galaxy has one
Ease of first use 🔘 🔘
Metadata Only CWL supports annotation
Conversion:
Convertible to CWL
Convertible to WDL 🔘/❌ CWL: outdated tool
Convertible to Nf 🔘/❌ CWL: outdated tool
Community:
Size of Community 🔘/🔘 🔘/✅ ✅/🔘 by # of Repos and Tools
Forum/StackOverflow/Reddit Nextflow most active

✅ good 🔘 ok ❌ bad ◽ invalid

However simple bash scripts would win most categories 😜

With all the testing i did, CWL still seems to be the right choice for our use cases (most likely: executing an existing script inside a container). CWL might be verbose but gives the users the most level of control. For Bioinformatics Nextflow and WDL sure are the best choices as all tools are already available. Also its a small set of commandline tools which can easily installed. they are most likely the best pick. But they both expect scripts (like the BL ones) as part of their inputs which makes them overridable and would just have to wrap "Rscript" as a Tool which is ok but the README Graph for example would say "Rscript" for each step then...

Opinions?

JensKrumsieck commented 1 month ago

superseded by https://github.com/fairagro/m4.4_concept/issues/12