Open mlin opened 2 years ago
I noticed wdl-packager rejects top-level WDLs that import from outside their own directory e.g. import "../../../foo/bar/bas.wdl". I'm concerned that several of the important Broad WDL repos (warp, viral-pipelines, etc.) have settled on that approach (example), so we should try to support it even if we organize our own source code differently...
I am all supporting a wide variety of use cases. But I am not for supporting use cases that are inheritantly unpackagable. There is no intuitive way to solve these problems, so the only solution is one where wdl-packager would behave unintuitively. Having unintuitive behavior causes a lot of friction for developers and users, so that is a big sacrifice to make in my opinion.
As for the broad case. Instead of importing ../../../mytask.wdl
. They could also do import mytask.wdl
. And then ln -s ../../../mytask.wdl
. That fixes their WDL and WDL-packager will simply convert the link into a file. Problem solved.
Maybe we can provide this sort of information in a packaging guide? That would be more helpful than coding around all possible package problems that might occur.
Hmm, you're making me reconsider whether to merge my custom file format! :sweat_smile:
Hmm, maybe wdl-packager can do some rewriting of import statements? i.e. ../../../wdltask.wdl
to wdltask.wdl
if there are no collissions?
That would be useful code to for https:// imports. Those can also be written to a file as well, ensuring reproducibility.
Would that be a solution? It sounds like something WDL-packager can support. Also I can add warnings.warn(WhatAreYouDoingWarning(".. imports in WDL,seriously?!?!?"))
to the wdl-packager code. That sounds like fun :wink:!
Although uncomfortable to change the source, I think it's slightly better than my approach that needs special logic in the source loader to "reinterpret" the import paths. Despite miniwdl not supporting full code generation, we could do a good enough job on the constrained problem of rewriting import
statements based on miniwdl's DocImport
object that records the line/column ranges in the source code.
That would be doable indeed. It would just be an ad verbatim copy except the import statement.
That is if, we think wdl-packager got WDL packaging right the first try. We might want to think about more about that. In principle I think the zip file approach is correct. That is also what Java uses for JAR files, so it can't be that bad. Python packages also, are simply archive files. I prefer zip here, because it can access files randomly from the archive (tar needs complete unpacking). We might want to add some sort of metadata format file which is mandatory, that contains paths to all sorts of other metadata in the package (LICENSE files etc.).
The only downside of ZIP is that it doesn't achieve the best possible compression, because it compresses each file separately without sharing a "dictionary." Usually the difference would be negligible, except that the reason I just got interested in packaging (after letting https://github.com/chanzuckerberg/miniwdl/issues/292 languish for years!) was to send the package in an AWS Batch SubmitJob environment variable, which has a 30 KiB total size limit. It'd be nice not to have to arrange for some S3 bucket or public web server to host the package. Anyway, I admit this is a weird case, it's just factually what got me onto this topic :smile:
wdl-packager now uses uncompressed zip files. Because they are still relatively small compared to checking out an entire repo including test files.
I suppose WDL packager could also use tar files. It's just packing an archive.
-rw-r--r-- 1 rhpvorderman rhpvorderman 43K 31 jan 11:38 germline_v4.1.0.tar.xz
-rw-r--r-- 1 rhpvorderman rhpvorderman 98K 31 jan 11:38 germline_v4.1.0.tar.gz
-rw-r--r-- 1 rhpvorderman rhpvorderman 149K 31 jan 11:38 germline_v4.1.0.deflate9.zip
-rw-r--r-- 1 rhpvorderman rhpvorderman 680K 31 jan 11:30 germline_v4.1.0.zip
-rw-r--r-- 1 rhpvorderman rhpvorderman 720K 31 jan 11:38 germline_v4.1.0.tar
Sorted on size. Tar uncompressed is bigger than zip. zip deflated is already quite small. But tar is indeed 33% smaller with the same deflate compression (level 9). XZ -9 works very well. Less than half the size of gzip.
EDIT: I uses zip initially because that is what cromwell supports. But as soon as tar.xz gets into the spec wdl-packager can also support that. I prefer to support only formats that can be parsed by Python's stdlib. (gzip, bzip2, lzma). Bzip2 is not a suitable candidate in my opinion as decompression speeds are very slow. Lzma works much better in that regard. Also I see that XZ format is supported in the java world as well.
The thought just occurred to me, since my need for maximum compression is kind of a corner case, for that purpose I can simply use XZ on the uncompressed ZIP file. It's probably better to stick with ZIP for general WDL packaging, which will be less surprising to the community.
If it helps I can add a compression level flag to wdl-packager. That should shrink the size somewhat.
The reason the zipfiles are uncompressed by default.
@mlin while writing the package specification I thought of another way to handle ..
style imports.
Given a file called my.wdl
, which lives in /home/jdoe/projects/wdlworkflow
that imports ../wdltasks/mytask.wdl
:
my.wdl
as wdlworkflow/my.wdl
in the zip../wdltasks/mytask.wdl
as wdltasks/mytask.wdl
in the zipmainWorkflowURL
to wdlworkflow/my.wdl
in MANIFEST.json
. Done, and no need to rewrite the source. It won't work for http style imports, but I hope this helps a bit with doing a minimum of import rewriting.
@rhpvorderman I had some similar thoughts early on in looking at this,
It seems possible in principle to embed in the ZIP file, the minimum directory subtree needed to make the relative paths work. The problem with that is the "main" WDL file gets buried in some subdirectory which isn't straightforward to find upon opening the ZIP file.
We could add a manifest file to the ZIP, with the equivalent of Main-Class: from a JAR manifest and that also could hold default inputs if desired. I could make miniwdl read that, but obviously no control over Cromwell. Come to think of it, I wonder how are repos with the above structure sent into Cromwell/Terra anyhow? (Probably by public URL instead of ZIP file?)
I'd still be concerned about producing a zip that wouldn't readily work with Cromwell.
Another thing I was unsure about was where to put the default input JSON (and "additional files" as wdl-packager supports). Do they go in the root or alongside the main WDL? Depending on that, what kind of relative paths does the input JSON use to refer to the additional files? By putting the main WDL in the root, I at least avoided having to make those decisions. :sweat_smile:
Thanks for the great PR on miniwdl btw, I'm a bit tied up the next few days but will send comments asarp.
I'd still be concerned about producing a zip that wouldn't readily work with Cromwell.
Ah, yes that is a very good reason. I can't believe we overlooked that.
Thanks for the great PR on miniwdl btw, I'm a bit tied up the next few days but will send comments asarp.
Take your time. I wrote a spec for WDL packaging too and put it up for discussion on the OpenWDL repo. Packaging is mainly a BioWDL issue, since we have all these modular repositories, so we had a discussion with the team and decided to push for this ourselves. With my scala and python skills I can implement any necessary features for supporting the packaging myself. So my PR on miniwdl is part of that effort. We also plan on hosting the initiial packages.openwdl.org repository that will be a result of this in the long run. Since we will be the primary benefactors.
@rhpvorderman @mlin is this still open for discussion? I would like to make this a standard we can put in OpenWDL
@vsmalladi Sure! The reason this has not had much activity is that currently this would only benefit the LUMC and we also have other things we do besides working on WDL pipelines. So these not very pressing concerns tend to get stuck on the backlog. It would be very nice if more people get involved.
@vsmalladi Also FYI- this repo & thread were the original inspiration for miniwdl zip
which @rhpvorderman also subsequently improved. It does rewrite import statements as needed to match the layout of what it stores in the zip file (it prints a warning if it has to do this)
@vsmalladi I already started on a discussion about packaging: https://github.com/openwdl/wdl/discussions/499
Hi @rhpvorderman @DavyCats
I noticed wdl-packager rejects top-level WDLs that import from outside their own directory e.g.
import "../../../foo/bar/bas.wdl"
. I'm concerned that several of the important Broad WDL repos (warp, viral-pipelines, etc.) have settled on that approach (example), so we should try to support it even if we organize our own source code differently...It seems possible in principle to embed in the ZIP file, the minimum directory subtree needed to make the relative paths work. The problem with that is the "main" WDL file gets buried in some subdirectory which isn't straightforward to find upon opening the ZIP file.
We could add a manifest file to the ZIP, with the equivalent of
Main-Class:
from a JAR manifest and that also could hold default inputs if desired. I could make miniwdl read that, but obviously no control over Cromwell. Come to think of it, I wonder how are repos with the above structure sent into Cromwell/Terra anyhow? (Probably by public URL instead of ZIP file?)Have you thought about this, any other approaches?
(I spent this past weekend educating myself about the packaging problem by implementing an alternative approach that inlines all the WDL source in a YAML file along with a manifest, and relies on dedicated logic in miniwdl's source loader to resolve the paths. However, after finishing it, I think I should just go back to ZIP files.)