Faster lookups for reuse leveraging IR

jtratner commented 5 years ago

Right now I still notice that queries for existing applets and files can take a varying amount of time - sometimes quite long. My guess is that some of this has to do with how system performance on rendering workflows or applets with long specifications.

Would it make sense to use an in query to limit the response to a smaller set of files, e.g., pseudocode-wise, rather than:

applets := findDataObjects(class=applet)
workflows := findDataObjects(class=workflow)

instead do:

applet_checksums = ['C56E42263E2AD139AC92BF6AE0AF4CDA', ...]
applets := findDataObjects(class=applet, properties=[{"dxWDL_checksum": checksum} for checksum in checksums])
workflows := findDataObjects(class=workflow, properties...)

I think the complex property will still cause backend to search through all objects, but I think it'll limit how many of the found objects get described, thus speeding response time.

(this is based on my guesses of overall implementation)

jtratner commented 5 years ago

(other optimization would be to only look for (id, name, property) when looking in current folder, tho that may already be happening - Found 7 workflows in project-Fb6Pxx00J4fJ38kjJqbpZ0ZK folder=/workflows/jtratner/<snip> (16349 millisec))

jtratner commented 5 years ago

For context, I think our standard compiled workflow now has 56 applets and 7 workflows per build. We are creating builds more and more frequently (from past 6 months we have 2.6K workflows and 8K applets, and that's ramping up) and we hope to (in general) keep using the same project indefinitely for builds to leverage reuse.

orodeh commented 5 years ago

I dug into this yesterday, with the help of the platform team. There are two separate problems, both on the platform side, not in dxWDL.

If the folder you are searching in is "/", the project root folder, the database search will recurse into the entire project.
The workflows generated by dxWDL are big, much larger than was envisioned a few years ago, when all workflows were written by hand. We think that the queries for workflows return the entire metadata on the back-end, not just the specified fields. Naturally, this is slow for large workflows. If you have lots of large workflows, this is even worse.

The platform team has filed bugs for these, and will work on them.

orodeh commented 5 years ago

The dxWDL part of this issue has been fixed; optimizations have been implemented for the find-data-objects queries. Therefore, I am closing it here. This remains an issue for the rest of our team.

jtratner commented 5 years ago

I actually meant a slightly different point here. Currently, my understanding is that the reuse component of workflow compilation functions as follows:

Compiles WDL to an internal IR
Grabs the first 1000 applets and first 1000 workflows that have a dxWDL_checksum property and creates a big table of digest => executable ID (ObjectDirectory)
Iterates through the IR, calculating the checksum for each applet or workflow on the fly and then seeing if it's present in the object directory.

Now that we have more than 2K applets in our project, that (1) step is going to start causing some issues.

I was thinking of a different strategy for projectWideReuse that would potentially be more performant (or at least require less data across the wire) and would work, without pagination, regardless of the size of the project.

Compile WDL to internal IR
Walk through IR, calculate digests for all items, collect digests into an array for workflows and array for applets.
decide on some batching size (maybe 10?) and make requests to findDataObjects with properties: {"$or": [{"dxWDL_checksum": "<hash1>"}, {"dxWDL_checksum": "<hash2>"}, ...}, etc. Use that to construct the ObjectDirectory
Only rebuild applets or workflows not found.

The good part is that this makes the lookup time be relative to the size of the workflow, rather than the size of the project and (I believe) this should cause the lookups to be relatively quick because they'll do a query for the property first, and then do the describe on the found IDs later. (my understanding is there's an index like (project, property, dxID)). Which means each request should be (relatively!) small.

orodeh commented 5 years ago

I asked the back-end team, and this is worth a try.

orodeh commented 5 years ago

It turns out that this isn't so simple to do, because the checksum cannot be computed just from the IR. It also covers referenced data-objects. It can only be calculated incrementally, while building a complex workflow (from the bottom up).

There are two ways I could think of to doing this: 1) Query the entire project, and make a big map from checksum to data-object. This requires one large query at the beginning. 2) As the workflow is built, query the platform for every new applet/sub-workflow generated. This requires many small queries.

Both approachs are non optimal, I am not sure which is better. In the meantime, I limited the number of results returned by adding a constraint on the data-object name. It has to be one of those we are generating, which is known at after the IR phase.

Let's see if 1.18 is sufficient.

orodeh commented 5 years ago

@jtratner, is this version better?

jtratner commented 4 years ago

What specifically can't be computed just from the IR? Any data objects should be resolvable prior to compilation, right?

orodeh commented 4 years ago

Right. But if you create a new applet, workflow, or data object, it has an unpredictable ID. Let's say that you have a workflow that compiles into, applet B, that depends on applet A. B's checksum requires the ID for A. You have to, first, create A, and then, create B.

jtratner commented 4 years ago

I’m suggesting that checksums should not include the dependencies compiled from the workflow. You could walk the entire IR, looking at dependents first, and generate checksums based upon checksums, etc. Then you could use those to define the dxWDLChecksum property.

At each step, you’d check for the value of that property and create only if it does not exist.

No reason to hash in the opaque identifier from DNAnexus

On Thu, Sep 26, 2019 at 9:07 AM Ohad Rodeh notifications@github.com wrote:

Right. But if you create a new applet, workflow, or data object, it has an unpredictable ID. Let's say that you have a workflow that compiles into, applet B, that depends on applet A. B's checksum requires the ID for A. You have to, first, create A, and then, create B.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dnanexus/dxWDL/issues/287?email_source=notifications&email_token=AAMGHK3TK273EJAI6XLZ3T3QLTM4ZA5CNFSM4IKXFT2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7WDTMQ#issuecomment-535574962, or mute the thread https://github.com/notifications/unsubscribe-auth/AAMGHK26PCESXQZ7WQD3UY3QLTM4ZANCNFSM4IKXFT2A .

jdidion commented 3 years ago

More recent versions of dxWDL, as well as dxCompiler, constrain the search by the applet names (which are deterministic). Hopefully this has sped up the query.

dnanexus / dxWDL

Faster lookups for reuse leveraging IR #287