Research running queries in parallel across cores

Moocar commented 6 years ago

Summary

Suggestion for speeding up query execution using multiple cores.

The Problem(s)

The Query execution phase of a Gatsby build is often the slowest for a variety of reasons:

Plugins can add fields to types that have custom resolvers (gatsby-source-filesystem)
- Resolvers might perform slow side effects like http requests or fs operations (e.g source-filessystem publicURL() resolver)
- plugin field values are lazy. So when supplying args to a graphql query, we must resolve all nodes of their type (more below)
There is usually a query per page. So sites with lots of pages have many queries to execute
sift query algorithm is naiive and iterates through entire node collection until result is found (TODO: Create issue to research using in-memory DB)

Solution

Run queries across multiple cores. Queries do not depend on each other, so there is potential to run them in parallel. In a sense, we already do this by using better-queues's concurrency option. But this only achieves async concurrency. We're still locked to a single core because node.js is single threaded. But libraries like jest-worker make spinning up extra node processes easier.

The solution would involve creating a number of jest workers before we start consuming from the query-queue.js. Each worker would have its own copy of the redux nodes namespace and so in theory would have everything it needed to execute queries. Then, the implementation of query-runner.js would be handled by these jest workers who would be given batches of queries to execute. Very similar to how HTML generation works now.

Complications

Cached plugin field resolver's values

Fields provided by plugins during setFieldsOnGraphQLNodeType must be resolved before the query can be run. Since this would be infeasibly slow to run for every query, aggressive caching is performed. Since we can't share memory, then each worker will have to keep its own cache. Another option would be to use a centralized cache in the main process, or in something like redis. The three caches used in run-sift.js are:

resolvedNodesCache - All fully resolved nodes. key = (type, nodes.length, query filter args)
enhancedNodeCache :: A single fully resolved (recursively finished) node. key = (node, query filter args)
enhancedNodesPromiseCache :: A current unfinished promise that's still executing (since it's not in enhancedNodeCache yet). key = (node, query filter args)

Increased Memory Footprint

For large sites, the nodes redux namespace is the largest. Each time we start a worker, we have to copy it to the new process. We could very quickly run out of memory. And it will only get worse the bigger the site is, which annoyingly is the exact use case we're trying optimize for.

This is amplified by the caching problem above. If we have to start maintaining multiple copies of caches too, memory usage will only increase.

There are many possible solutions to this. One would be to use heuristics to determine how many workers to create. For a site that has many queries, but not too many nodes, then multiple workers will work great. Whereas a site with a huge nodes namespace but fewer queries won't see a major increase in performance and will run out of memory quicker. We may be able to use these metrics to figure out when to run queries in parallel.

Resolvers record page dependencies

Almost every Graphql resolver records a page dependency if a result is found. This results in an action being reduced into redux componentDataDependencies namespace. If query workers are in separate processes, they won't have access to that redux store. Some solutions might be:

1. Return dependencies that need to be recorded with result

Rather than recording the dependency within the resolver gqlType, create a data structure that is somehow returned with the query result. Then in the main process, we can iterate through that list and save the new dependencies. I'm not sure how we would communicate that datastructure up through all the layers of graphql. We could also expose a queue or datastructure in the context that the resolvers can directly edit directly.

2. Figure out how to move dependency recording out of resolvers

Every resolver must explicitly record a page dependency if a result is found. This occurs all throughout the code base. And plugins authors must remember to do this too. Instead, there might be a way to record the dependency by iterating through the query result (no idea if it's possible).

Node Tracking

In run-sift.js, we resolve all query args (to realize custom plugin fields). Whenever a new field value is resolved, we track its dependency back to its root source node. This root node tracking is stored in the node-tracking.js:rootNodeMap global var. Out of process workers will not have access to this var.

A solution to this would be similar to what we figure out above for recording page dependencies.

Sharp Processing is still high cost

The sharp plugin resolver is one of the heaviest users of CPU and is already optimized to use all cores. So sites whose query phase is mostly taken up by image processing wouldn't see a noticeable increase from parallel query workers.

Global state in custom plugin fields

Plugins that perform heavy processing (e.g sharp, squip, markdown) have their own global state. Most are local caches and in memory queues. These caches will be built from scratch in each worker which is not ideal.

This becomes a problem if the same node field gets resolved over and over again. Firstly, the caches on each worker will contain copies of the same data, leading to increased memory. But the bigger issue is that cached promises won't be shared. Imagine a plugin's field resolver takes 2s to execute. And 4 queries reference that same node. If all 4 of those queries execute at the same time and are split across 4 workers, then each core will repeat the work. Whereas right now, they can reference the same shared memory to check if an operation is already executing.

custom plugin fields can call actions

transformer-sharp calls actions.createJob() which updates redux to track a job. This won't work if called from a separate process. Any custom plugin field resolver can call arbitrary functions.

One possible solution here is to change how actions are bound so that their datastructures are sent back to the central process via Inter Process Communication (IPC) and there the actions take over. But we'd have to make sure that transactionability was still maintained.

KyleAMathews commented 6 years ago

@m-allanson @pieh @DSchau @rase- and I met to investigate this issue this morning.

We identified some low-hanging fruit @DSchau will work on today: https://github.com/gatsbyjs/gatsby/issues/7373#issuecomment-423589800

Next step is to investigate why graphql query running slows dramatically for certain sites e.g. https://github.com/herotc/bfa.herodamage.com. There's probably some easy ways to speed them up.

Then next, we discussed your research @Moocar and it seems that before we can go multi-process, we need to figure out a shared cache strategy as otherwise, copying data around + the duplication of work + caches would eat up most of the gains from going multi-process.

Moocar commented 6 years ago

Good stuff. I agree that a shared cache strategy is needed. There's so much ad hoc caching going on in Gatsby.

I'm currently researching what would be required to use an in-memory DB. I think there will be significant gains there.

CanRau commented 6 years ago

Might my issue here somehow be related https://github.com/gatsbyjs/gatsby/issues/8330 ? Cause the error is mentioning jest-worker or is it used elsewhere? From this comment it doesn't even seem to be integrated yet though^^

Moocar commented 6 years ago

@CanRau this is just research for now. I haven't written any code around this yet so it won't be related to your issue.

CanRau commented 6 years ago

yeah that's what I meant in my last line^^ but it still sounded related^^

gatsbot[bot] commented 5 years ago

Old issues will be closed after 30 days of inactivity. This issue has been quiet for 20 days and is being marked as stale. Reply here or add the label "not stale" to keep this issue open!

gatsbyjs / gatsby