caching results across engines

broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments

http://cromwell.readthedocs.io/

BSD 3-Clause "New" or "Revised" License

995 stars 360 forks source link

caching results across engines #4616

Open notestaff opened 5 years ago

notestaff commented 5 years ago

Feature request: allow call caching to work across different engines. Currently, only calls run on the same engine can be reused. But if task inputs and docker hash are the same, results should be reusable across engines. This might require copying files across filesystems; but if there is a Local filesystem, then any file can be copied first to that and then to the target filesystem.

geoffjentry commented 5 years ago

Hi @notestaff - that's not universally true. We've also heard users say that they don't want this to happen for various reasons, including e.g. not paying egress across cloud filesystems.

It could potentially become an optional flag (we recently weakened the rules here for a similar reason) but I don't think it should ever be made this way globally

notestaff commented 5 years ago

@geoffjentry I understand re: egress charges. In my use case these aren't an issue, so a flag option would still help. Maybe, make egress cost a config option of a filesystem, and only reuse results if the egress cost would be under some user-specified value? You can also drop the requirement of specifying one engine/filesystem for all tasks. You could then return a cached result from any filesystem where it exists, without needing to copy it to a target filesystem. You could then also let workflow inputs point to files on different filesystems, and automatically choose the engine for each job based on where its inputs are.