danbills commented 6 years ago

What happens

When a large workflow is queried for metadata, cromwell spends a considerable amount of time preparing the repsonse. This usually results in a timeout for the caller. In some cases, the preparation is so expensive that Cromwell either runs out of memory or enters a zombie-like state(#4105).

What should happen

The caller should receive a timely response, and Cromwell should not be endangered by operations on large workflows.

Speculation: Construction of result

The result is constructed in a two-phase manner: gather all the data, then produce a structured response.

This is done for two reasons:

Unstructured metadata is difficult for a human to understand.
There are possibly many duplicates due to the way restarts are handled.

Recommendation

~Stream results (using doobie SQL library?) and construct response while gathering data. This should mean that a large pool of data is never present in memory, only the current result set and the partial response.~

Not streaming for now. Instead going to foldMap large sequence into Map monoid, then combine all those maps together into a final result.

There is some manipulation to be done after combining a result.

Sort calls by time
Prune duplicates by taking the most recent. This has some special cases that need to be considered.

Speculation: Database table

The metadata table is currently an unindexed monster, comprising 10^6 - 10^9 rows and between 2-3 TB of data. The query has historically been surprisingly performant but is likely going to degrade over time.

Recommendation

punt on DB changes

Believe to be related to #4093 and #4105

Horneth commented 6 years ago

FWIW Slick also supports streaming

geoffjentry commented 6 years ago

@Horneth be careful with getting excited about that. The few times I thought that it should solve some problem I had it turned out the answer was "Slick streaming doesn't work like that". Not to say that'll happen here, but my success rate has been low here :)

Horneth commented 6 years ago

I didn't know that qualified as me being excited but good to know 😄

geoffjentry commented 6 years ago

As I thought about it more my use cases were all different than the simple "make a query and stream the results" which is likely what we'd want here so my skepticism is probably unfounded :)

Horneth commented 6 years ago

Investigation update

This branch contains code attempting to stream metadata events from the database and build the json as events arrive. It does not stream the json itself back to the endpoint. The whole json is still built in memory and then returned (see the end for thoughts on that).

Results:

The good

Streaming the data from the database has a positive effect on memory (left CPU, right heap) The below graphs represent Cromwell's activity when it's building a metadata of around 2.8M metadata events.

Building metadata without streaming: screen shot 2018-10-17 at 11 16 32 am

We can see that memory builds up throughout the process of generating the JSON, with a larger burst towards the end. CPU activity is inexistent until the very end where a lot of CPU resources are needed to go through all the events and build the json.

Building metadata with streaming: screen shot 2018-10-17 at 10 08 08 am

In contrast, here there is moderate CPU activity throughout the process, as well as lots of a much more sawtooth-looking heap graph, indicating that objects are getting GCed a lot. The max memory used is also smaller than for the non streaming version.

Using a streaming approach allows the stream to be stopped at any point in time (say if we ran over the endpoint timeout). Note that even without streaming data from the database, we can still build the json from the strict set of events using an fs2 stream and stop that if/when needed.

Another graph where Cromwell was asked to build several large metadata jsons:

screen shot 2018-10-19 at 1 17 28 pm Red is non streaming, blue is streaming.

The main takeaway is that when under memory pressure (i.e when available memory is insufficient to build the requested metadata), streaming makes a significant difference on relieving the heap usage for medium to large (> 100K) metadata.

The less good

Response time is not as good

The use cases above were specifically targeted towards trying to build large to very large metadata. However when used in a more realistic scenario with lots of small sized metadata and few large ones, the overall response time is increasing significantly. If Cromwell has sufficient memory to sustain the load then streaming does not give any real improvement. The graph below shows memory usage with (v1s) and without streaming (v1) when Cromwell has enough memory to build all requests (in MB). memory-v1-v1s

The graph below shows the average response time of the metadata endpoint with and without streaming (in ms). metadata-200-v1-v1s

A plausible explanation of the response time increase is that the connection to the DB needs to remain open (and can't be re-used) for as long as the stream is not closed. This includes time spent pulling data out of the database AND building the JSON. Whereas in the non streaming version, the connection can be re-used for another query as soon as all the data has been pulled and Cromwell is building the metadata. The extra time spent with the connection used in the streaming version can then delay subsequent requests when lots of metadata requests are being made. We also see that the graph spans longer on the X axis for the streaming version, which means the test took longer to complete. The test consists of sending a lot of metadata requests to Cromwell.

Thoughts, possible next steps and/or things to try

I think the fs2 stream model is still interesting as it allows for a clean interruption of building of the metadata (with or without streaming from the database).
It might be possible to choose between streaming and non streaming depending on the size of the metadata to build (would require a COUNT(*) beforehand)
It might be possible to order the database request (for instance if the query was sorted by call fqn, metadata key and timestamp) in such a way that the json can be built: 1) Directly, i.e without need for the wrapping MetadataComponent object to maintain information about indices in the list and CRDT (which would reuse memory usage and possible build time) 2) Piece by piece and be streamed back to the requester
Building the metadata and storing it is always a possibility. Things to consider are
- When are we sure it's complete ?
- We'd still want the ability to use includeKey and excludeKey query parameters

davidangb commented 5 years ago

you've done a bunch of investigation on this already, but adding for posterity: the metadata endpoint is particularly vulnerable to the joint-calling use case. In this use case (and in similar workflows), calls can scatter widely, and each call can have many inputs, each of which is a substantial value. So, calls scatter inputs * value length makes for a lot of data.

broadinstitute / cromwell

Slow, expensive metadata endpoint #4124

What happens

What should happen

Speculation: Construction of result

Recommendation

Speculation: Database table

Recommendation

Investigation update

The good

The less good

Thoughts, possible next steps and/or things to try