common-workflow-language / cwltool

Common Workflow Language reference implementation
https://www.commonwl.org
Apache License 2.0
324 stars 225 forks source link

CWLProv: add option exclude raw copies of the data #1586

Open mr-c opened 2 years ago

mr-c commented 2 years ago

https://matrix.to/#/!RQMxrGNGkeDmWHOaEs:gitter.im/$AJGFCdt6jVAn3aR5lQ0PK3_0SGgvFrubf5SMClsOgGA (a.k.a https://gitter.im/common-workflow-language/common-workflow-language?at=61d6a7bfbfe2f54b2e04661d )

jjkoehorst commented 2 years ago

Thanks for creating the ticket for me personally only the metadata (rdf files, workflows files) are needed. As the input and output files are preserved on a cloud store.

mr-c commented 2 years ago

Areas to investigate, (add flag to skip the copying, but still calculate and store the checksums)

https://github.com/common-workflow-language/cwltool/blob/a1e3449560b964d90818b2f1bfeb9b411415a786/cwltool/provenance.py#L790 https://github.com/common-workflow-language/cwltool/blob/a1e3449560b964d90818b2f1bfeb9b411415a786/cwltool/provenance.py#L929

https://github.com/common-workflow-language/cwltool/blob/a1e3449560b964d90818b2f1bfeb9b411415a786/cwltool/provenance.py#L741 called from https://github.com/common-workflow-language/cwltool/blob/a1e3449560b964d90818b2f1bfeb9b411415a786/cwltool/main.py#L1413

https://github.com/common-workflow-language/cwltool/blob/a1e3449560b964d90818b2f1bfeb9b411415a786/cwltool/provenance.py#L894

jjkoehorst commented 2 years ago

To update this, when providing Directory or Files as input it will copy the entire content to /tmp. Solution for now is to use Strings instead of Directory when possible.

mr-c commented 2 years ago

Solution for now is to use Strings instead of Directory when possible.

FYI, while that may work for now, that will break mulit-node execution of the workflow