Open jtratner opened 5 years ago
(other optimization would be to only look for (id, name, property) when looking in current folder, tho that may already be happening - Found 7 workflows in project-Fb6Pxx00J4fJ38kjJqbpZ0ZK folder=/workflows/jtratner/<snip> (16349 millisec)
)
For context, I think our standard compiled workflow now has 56 applets and 7 workflows per build. We are creating builds more and more frequently (from past 6 months we have 2.6K workflows and 8K applets, and that's ramping up) and we hope to (in general) keep using the same project indefinitely for builds to leverage reuse.
I dug into this yesterday, with the help of the platform team. There are two separate problems, both on the platform side, not in dxWDL.
The platform team has filed bugs for these, and will work on them.
The dxWDL part of this issue has been fixed; optimizations have been implemented for the find-data-objects queries. Therefore, I am closing it here. This remains an issue for the rest of our team.
I actually meant a slightly different point here. Currently, my understanding is that the reuse component of workflow compilation functions as follows:
ObjectDirectory
)Now that we have more than 2K applets in our project, that (1) step is going to start causing some issues.
I was thinking of a different strategy for projectWideReuse that would potentially be more performant (or at least require less data across the wire) and would work, without pagination, regardless of the size of the project.
properties: {"$or": [{"dxWDL_checksum": "<hash1>"}, {"dxWDL_checksum": "<hash2>"}, ...}
, etc. Use that to construct the ObjectDirectoryThe good part is that this makes the lookup time be relative to the size of the workflow, rather than the size of the project and (I believe) this should cause the lookups to be relatively quick because they'll do a query for the property first, and then do the describe on the found IDs later. (my understanding is there's an index like (project, property, dxID)). Which means each request should be (relatively!) small.
I asked the back-end team, and this is worth a try.
It turns out that this isn't so simple to do, because the checksum cannot be computed just from the IR. It also covers referenced data-objects. It can only be calculated incrementally, while building a complex workflow (from the bottom up).
There are two ways I could think of to doing this: 1) Query the entire project, and make a big map from checksum to data-object. This requires one large query at the beginning. 2) As the workflow is built, query the platform for every new applet/sub-workflow generated. This requires many small queries.
Both approachs are non optimal, I am not sure which is better. In the meantime, I limited the number of results returned by adding a constraint on the data-object name. It has to be one of those we are generating, which is known at after the IR phase.
Let's see if 1.18 is sufficient.
@jtratner, is this version better?
What specifically can't be computed just from the IR? Any data objects should be resolvable prior to compilation, right?
Right. But if you create a new applet, workflow, or data object, it has an unpredictable ID. Let's say that you have a workflow that compiles into, applet B
, that depends on applet A
. B's checksum requires the ID for A
. You have to, first, create A
, and then, create B
.
I’m suggesting that checksums should not include the dependencies compiled from the workflow. You could walk the entire IR, looking at dependents first, and generate checksums based upon checksums, etc. Then you could use those to define the dxWDLChecksum property.
At each step, you’d check for the value of that property and create only if it does not exist.
No reason to hash in the opaque identifier from DNAnexus
On Thu, Sep 26, 2019 at 9:07 AM Ohad Rodeh notifications@github.com wrote:
Right. But if you create a new applet, workflow, or data object, it has an unpredictable ID. Let's say that you have a workflow that compiles into, applet B, that depends on applet A. B's checksum requires the ID for A. You have to, first, create A, and then, create B.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dnanexus/dxWDL/issues/287?email_source=notifications&email_token=AAMGHK3TK273EJAI6XLZ3T3QLTM4ZA5CNFSM4IKXFT2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7WDTMQ#issuecomment-535574962, or mute the thread https://github.com/notifications/unsubscribe-auth/AAMGHK26PCESXQZ7WQD3UY3QLTM4ZANCNFSM4IKXFT2A .
More recent versions of dxWDL, as well as dxCompiler, constrain the search by the applet names (which are deterministic). Hopefully this has sped up the query.
Right now I still notice that queries for existing applets and files can take a varying amount of time - sometimes quite long. My guess is that some of this has to do with how system performance on rendering workflows or applets with long specifications.
Would it make sense to use an in query to limit the response to a smaller set of files, e.g., pseudocode-wise, rather than:
instead do:
I think the complex property will still cause backend to search through all objects, but I think it'll limit how many of the found objects get described, thus speeding response time.
(this is based on my guesses of overall implementation)