common-workflow-language / cwltool

Common Workflow Language reference implementation
https://cwltool.readthedocs.io/
Apache License 2.0
330 stars 230 forks source link

too slow validating some packed workflows #1063

Open mr-c opened 5 years ago

mr-c commented 5 years ago

The packed version of https://raw.githubusercontent.com/genome/analysis-workflows/fa0bf2a51b72cd0869253943b67aa8e271633945/definitions/pipelines/gathered_somatic_exome.cwl takes almost three minutes to validate on my laptop

mr-c commented 5 years ago
cwltool --pack https://raw.githubusercontent.com/NCI-GDC/gdc-dnaseq-cwl/3cb464a3a5c39cc060cd23d9c60918bc9ffb169b/workflows/bamfastq_align/etl.cwl > packed.cwl
/usr/bin/time cwltool --validate packed.cwl

The above clocks in at 5 minutes 40 seconds for me

mr-c commented 5 years ago

Current worse case real-world CWL

cwltool --validate https://gist.github.com/mr-c/8d8597c76cdb9ae11ff8931792d489a8/raw/1107fcec6ac6f814b7d117d6a90b4fa7f7451038/etl.packed.cwl

profile leaves profile

Comments from the external reviewer

You will find enclosed the best way i found out to summurize the applications profile. Basically it is a tree. Leaves seem to execute recusively the same function. I also included the detail of some leaves to illustrate that. The main concern is that the application is purely sequential. We could surely optimize some small parts of the existing app but the real lever that would lead us on the road to your target is parallelism.

tetron commented 5 years ago

How does the time to validate packed compare to the time to validate the original workflow split into multiple files?

Getting good parallelism is going to be hard with the Python GIL.

I would expect codegen to be faster because it combines the resolution and validation steps into a single pass instead of two passes, and it doesn't have to constantly refer back to the schema because everything is inlined.

Alternately we could probably come up with a better validation algorithm that is linear time instead of the current recursive decent validation, which relies on backtracking which is expensive.

mr-c commented 5 years ago

https://github.com/mr-c/gdc-dnaseq-cwl/blob/validation_speedup_testing/workflows/bamfastq_align/etl.cwl

Separate files (35 seconds "wall" time)

35.52user 0.82system 0:35.56elapsed 102%CPU (0avgtext+0avgdata 321908maxresident)k
0inputs+8outputs (0major+669503minor)pagefaults 0swaps

vs. 2 minutes 37 seconds for the packed file

$ wget https://gist.github.com/mr-c/8d8597c76cdb9ae11ff8931792d489a8/raw/1107fcec6ac6f814b7d117d6a90b4fa7f7451038/etl.packed.cwl
$ cwltooll --validate etl.packed.cwl
157.62user 0.99system 2:37.77elapsed 100%CPU (0avgtext+0avgdata 366736maxresident)k
0inputs+8outputs (0major+680178minor)pagefaults 0swaps