Discovery: test content build with 48k nodes on Jenkins

department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:

https://depo-platform-documentation.scrollhelp.site/index.html

283 stars 204 forks source link

Discovery: test content build with 48k nodes on Jenkins #24022

Closed alexandec closed 3 years ago

alexandec commented 3 years ago

Details

Previous testing with 48k nodes was done locally. For the most realistic test we need to run the content build with 48k nodes on Jenkins. Run the build multiple times to surface intermittent errors. Some areas to note:

Peak memory usage
Total build time
Tinyliquid timeouts, if any
Any warnings or failures

Acceptance Criteria

[x] Content build is tested on Jenkins with 48k+ nodes
[x] Memory use, build time, and any errors are documented here

alexandec commented 3 years ago

Was encountering consistent tinyliquid timeouts which prevented the build from completing. Determined that timeout was still at 2 minutes even though we were attempting to set it to 20 minutes. After diving into the tinyliquid and metalsmith-layouts code, I arrived at a new approach which works like this:

const originalRun = liquid.run;
  liquid.run = (astList, context, callback) => {
    // eslint-disable-next-line no-param-reassign
    context.options.timeout = 1200000;
    originalRun(astList, context, callback);
  };

This successfully sets the timeout to 20 minutes. Now I can test and benchmark with 48k nodes on Jenkins.

alexandec commented 3 years ago

Testing with 49k nodes. I'm seeing more Jenkins errors with this new test data but I haven't figured out why. Errors include:

[2021-05-05T21:38:03.616Z] Cannot contact EC2 (jenkins-cloud) - vetsgov-general-purpose-1b (sir-r2g1b3xh): java.lang.InterruptedException

And also:

[2021-05-05T23:47:51.843Z] wrapper script does not seem to be touching the log file in /home/jenkins/workspace/_24022-jenkins-48k-nodes-testing@tmp/durable-9bce82a0
[2021-05-05T23:47:51.843Z] (JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)

And finally this, which seems to indicate an out of memory error:

[2021-05-05T23:15:23.853Z] npm ERR! errno 137

I'm seeing a lot of virtual memory use in the local build, which I'm guessing is causing the problems on Jenkins. The local build gets killed near the end:

Step 47 of 55 start: Rewrite VA domains for the buildtype
error Command failed with signal "SIGKILL".

At that point the local build is using about 3gb of real memory and about 50gb of virtual memory. So I think VM is the issue.

Need to figure out why this is happening with the new 49k node test data. It was not occurring with the old 48k node test data.

alexandec commented 3 years ago

I tried using a couple new flags which increase garbage collection frequency:

NODE_TLS_REJECT_UNAUTHORIZED=0 node --max-old-space-size=8192 --trace_gc --optimize_for_size --gc_interval=100  script/build-content.js

This successfully reduced peak memory use, but also increased build time by about 2x. Based on the --trace_gc output there was just too much scavenging happening. But I think by increasing gc_interval and/or setting --max-old-space-size to a lower value, we can find a better balance of garbage collection and build time, while preventing Jenkins from running out of memory. I'll experiment further with these changes.

alexandec commented 3 years ago

Currently experimenting with just setting --max-old-space-size lower. Too low, and the build fails because it runs out of heap space. Too high, and the build fails because the EC2 instance runs out of memory and kills the process.

If I can find just the right setting for the amount of memory available, then V8 will perform only minimal garbage collection most of the time. As memory use nears the limit, GC frequency will automatically increase, dropping memory use back down. That should be a lot more efficient than forcing frequent garbage collection all the time with --gc_interval.

alexandec commented 3 years ago

Talked with ops to see if they had any ideas. They provided a Prometheus graph which shows available memory percentage for the Jenkins nodes. I also added some code to log memory statistics in more detail:

 const logProcMeminfo = () => {
  const command = 'cat /proc/meminfo';
  exec(command, (err, stdout, stderr) => {
    if (err) {
      console.log('error: ', err);
      return;
    }

    console.log(`${command}:\n${stdout}`);
    if (stderr) console.log(`stderr: ${stderr}`);
  });
};

Comparing this output to the Prometheus graph of the Jenkins node free memory, I could see that they matched. When build starts, Jenkins reports about 80% of its memory as free (about 25gb of 32gb total memory). But when the Drupal data is loaded into Metalsmith, this drops to about 20% free (about 6gb). As the build continues, free memory fluctuates but eventually drops to around 10% free (3gb). At this point the build fails with an error like this:

Step 45 of 54 start: Apply layouts 0-9.-_~!$&'()*+,;=:@
Error: spawn ENOMEM
    at ChildProcess.spawn (internal/child_process.js:403:11)
    at spawn (child_process.js:553:9)

Or an error like this:

Killed
npm ERR! code ELIFECYCLE
npm ERR! errno 137

Both errors indicate insufficient memory (either physical or virtual). The issue appears to be that with 3 builds running simultaneously (dev, staging, and prod), each build eventually uses almost 10gb of memory (about 30gb total), leaving only 2-3gb left on the EC2 instance which runs Jenkins. Jenkins then kills one or two of the builds to free up memory, leaving the remaining build(s) to finish normally. Here's what the graph looks like:

The confusing part is why so much memory is used by the builds. There's only 1gb total of Drupal query data. The Drupal assets probably add to this. But we shouldn't see 10gb of memory use per build. The number should be more like 3-4gb based on the JS heap size diagnostics I've been seeing. My attempts to improve garbage collection and limit heap size have not been able to fix the problem. I want to investigate further to determine the cause of all this extra memory use.

alexandec commented 3 years ago

I added some new memory diagnostics to better understand why memory is running out. heapUsed is the amount of the V8 heap currently in use. This is the memory stat we were already displaying with the --verbose option. heapTotal is the total amount of V8 heap allocated (it may not all be currently used). rss is Resident Set Size which is the total amount of RAM currently used by the nodejs process, including the heap but also the stack and any other code in use.

Near the beginning of the build we have:

[2021-05-12T19:09:41.840Z] heapUsed: 41.2mb
[2021-05-12T19:09:41.840Z] heapTotal: 62.84mb
[2021-05-12T19:09:41.840Z] rss: 106.44mb

No issues here, though you can see that heapUsed doesn't tell you the whole story. rss (total memory used) is more than double the heapUsed value. After drupal data is piped into metalsmith we have:

[2021-05-12T19:17:49.509Z] heapUsed: 2185.29mb
[2021-05-12T19:17:49.509Z] heapTotal: 3730.55mb
[2021-05-12T19:17:49.509Z] rss: 4928.78mb

We've fetched about 1gb of JSON data, placed it in an object, and started metalsmith running on that object. The heap used (2.2gb) is about twice the amount of actual data. rss, in turn, has already grown to more than twice the heap size. At this point we are consuming 15gb of RAM (5gb x 3 separate builds) on the Jenkins EC2 instance, which is about half the available RAM.

When the build completes we have:

[2021-05-12T20:13:00.331Z] heapUsed: 2425.47mb
[2021-05-12T20:13:00.331Z] heapTotal: 3031.91mb
[2021-05-12T20:13:00.331Z] rss: 10695.59mb

The heap is still at a reasonable size. However total memory used (rss) for all 3 builds would be slightly over 32gb which is the entire amount of RAM available on the EC2 instance. By this point Jenkins has already killed one or two of the builds due to memory exhaustion.

Next step is to investigate what is being stored in non-heap memory. This will determine how to address 7gb difference between heapTotal and rss size.

alexandec commented 3 years ago

I added a couple new memory diagnostics and displayed them automatically at the end of each step, like this:

[2021-05-13T20:31:24.120Z] Step 46 of 54 start: Rewrite VA domains for the buildtype
[2021-05-13T20:32:20.303Z] Step 46 end [54860ms]: Rewrite VA domains for the buildtype
[2021-05-13T20:32:20.303Z] Starting memory: 2036.38mB
[2021-05-13T20:32:20.303Z] Ending memory: 2001.7mB
[2021-05-13T20:32:20.303Z] Delta: -34.68mB
[2021-05-13T20:32:20.303Z] Memory stats:
[2021-05-13T20:32:20.303Z] - rss: 8985.42mb
[2021-05-13T20:32:20.303Z] - heapTotal: 3157.96mb
[2021-05-13T20:32:20.303Z] - heapUsed: 2001.74mb
[2021-05-13T20:32:20.303Z] - external: 2943.66mb
[2021-05-13T20:32:20.303Z] - arrayBuffers: 73.55mb

RSS is the number that's most important since that's the actual memory use in the EC2 instance. I tried commenting out two steps:

Rewrite VA domains for the buildtype
Parse a virtual DOM from every .html file and perform a variety of DOM sub-operations on each file

and the build succeeded with a peak of about 20gb total memory used (3 builds * 6.5gb used per build). With those two steps in place, rss increased to >9gb per build and the instance ran out of memory. In particular the Rewrite VA domains step increased rss by about 2gb, and it did not decrease after the step completed. This step is pretty straightforward and should not result in the 2gb per build increase we are seeing. I'll investigate it further.

alexandec commented 3 years ago

I implemented manual garbage collection every 10 seconds like this:

setInterval(() => {
  global.gc();
}, 10 * 1000);

This forces a mark-sweep which stops execution briefly and cleans up all objects not referenced anywhere. It's slow but thorough and it keeps garbage from accumulating. However I found that this gets paused temporarily when the code is executing in a really tight loop, so I added additional GC in the Rewrite VA domains and Parse a Virtual DOM steps like this:

let itemsSinceCall = 0;
Object.keys(files).forEach(fileName => {
  // Code to perform replacements in file contents goes here
  // ...
  itemsSinceCall++;
  if (itemsSinceCall > 100) {
    itemsSinceCall = 0;
    global.gc();
  }
}

These additional garbage collection methods freed up enough memory for the build to complete without Jenkins killing it.

Here was the command line I used:

NODE_TLS_REJECT_UNAUTHORIZED=0 node --max-old-space-size=5000 --expose-gc --trace_gc script/build-content.js

With all the above in place, Jenkins successfully completed runs for all three environments simultaneously. Here are the stats:

Total content build time: 61 minutes
Peak memory use: 26.7gb (8.9gb per build * 3 simultaneous builds)

I'll write up additional findings and recommendations in a separate comment.

alexandec commented 3 years ago

Buffer garbage collection

Even with the increased manual garbage collection described above, we are still seeing memory use which is greater than expected. I created a small piece of code to investigate buffer garbage collection:

const showMemory = msg => console.log(msg, process.memoryUsage().rss / (1024 * 1024));

function run() {
  showMemory('before');
  for (let i = 1; i <= 25000; i++) {
    Buffer.from('*'.repeat(100000));
  }
  showMemory('after');
}

run();

This just runs Buffer.from() repeatedly without assigning the result to anything. In theory there should be no effect on memory use. However the results show that RSS (total memory usage) increases significantly (values in mb):

before 18.87890625
after 2475.80078125

It appears that garbage collection is not happening so the unused buffer allocations persist and grow over time. I tried introducing a 1ms wait every 100 iterations to give the garbage collector a chance to run:

const sleep = ms => new Promise(resolve => setTimeout(resolve, ms));
async function run() {
  showMemory('before');
  for (let i = 1; i <= 25000; i++) {
    Buffer.from('*'.repeat(100000));
    if (i % 100 === 0) await sleep(1);
  }
  showMemory('after');
}

The result is a huge improvement in memory use:

before 18.7578125
after 27.23828125

I added the same 1ms wait after 100 iterations to the "Rewrite VA Domains" and "Parse a virtual DOM" steps. This reduced peak memory use by about 2gb per build.

alexandec commented 3 years ago

Benchmarks (Jenkins, 48k nodes)

With manual garbage collection in place, downloaded files not stored, and debug info turned off (see branch):

Total content build time: 58 minutes
Total Jenkins run time including build, test, prearchive, and archive steps: 67 minutes
Peak memory use: 8gb per build
Total peak memory use (3 simultaneous builds + OS): 24.5gb.

Actual memory usage and build time may vary a bit with full implementation as described below, but these numbers give us a good approximation of what to expect at 48k nodes.

Here's a Prometheus graph of total free memory throughout the build process, as observed on the Jenkins node (EC2 instance):

The V8 heap varies between about 2gb and 4.5gb throughout the build process. Additional memory use is mostly accounted for by buffer allocations, which are not stored in the heap. Metalsmith stores all file contents in buffers, like:

files[path].contents = Buffer.from('<!DOCTYPE html>...');

This mean non-heap memory use is greater than heap use in our case. The heap only stores the data fetched from GraphQL in one huge files object. Buffers store templates populated with that data, including additional text, sidebars, headers/footers, etc. Buffer use grows continually as file contents are added during the "Apply layouts" steps. This is visible as a continual downward trend in the graph, with some fluctuations as the heap grows temporarily, then is reduced to baseline by garbage collection.

At peak use in this graph we have about 7.5gb free, which is 2.5gb free per build. Usable free memory is a bit less in practice because Jenkins or the OS would kill one or more builds before reaching zero free memory.

GraphQL

Total GraphQL query time with a warm cache was about 1 minute. The above benchmarks were performed with a warm cache. With a cold cache, total query time was about 25 minutes. Server used was test.dev.cms.va.gov so query times may not be representative of a production server with 48k nodes. But GraphQL query time may become an issue at 48k nodes.

General notes

We are pushing the limits with both build time and total memory use. We need build times under an hour to allow hourly content releases. And the total memory available on our Jenkins EC2 instances is 32gb. Some of this memory is used by the OS (about 700mb). With three builds running simultaneously, we have about 10gb to spare at peak, or roughly 3gb per build with debug information removed. Adding back the debug information may increase peak usage (debug information was removed during this testing).

Implementing manual garbage collection reduced peak memory use, allowing the builds to complete on Jenkins. However this also increased execution time by about 10 minutes. There are limits to trading off build time for reduced memory use, since we are close to our 1-hour max already.

To reach node counts beyond 48k we will need to aggressively target both memory use and build time for potential improvements. The most obvious improvements are already in place, so future improvements will be more difficult to find and implement. Some changes may involve targeting Metalsmith itself, or Metalsmith plugins like metalsmith-layouts and metalsmith-permalinks. Another area of focus could be replacing DOM-based changes with simpler text replacements.

Recommendations

To enable 48k node content builds on Jenkins, we should implement the following:

Increase tinyliquid timeout to avoid timeouts with large amounts of data (#23667)
Add manual garbage collection to keep heap size and total memory use in check (#25074)
Stop storing downloaded file contents in the files object -- store contents on disk only (#25088)
Add debug info at the last possible step in the build process to keep memory use small throughout the build (#25100)

drorva commented 3 years ago

Lots of good work and I love the level of detail you have as you make progress including the code samples. Nothing too surprising, yes manipulating GC can provide gains, but there's a price :-). G8 as a rule is very optimized and I expect the GC to be generally very good though for specialized cases you can get gains as you found out above.

Questions and suggestions

How do 50k nodes = 1GB? Are nodes really 20k on average or are there additional artifacts?
We are using multiple cores on the machine? That can make a huge difference.
It might sound counter-intuitive, but part of the problem is that metalsmith is doing everything in memory. The good news is that the code is tiny so it's pretty easy to determine that it uses the "ware" package and it should seems like it shouldn't be too complicated (famous last words :-)) to change ware to use a file name instead of the actual object and save and read to file between steps. I suspect that if you do that, you'll find the memory footprint shrinking, but the time going up, possibly significantly. In that case you might want to experiment with keep some steps in memory and only saving and reading from disk at specific steps.
When all else fails, we can look at using a beefier machine. The problem with that is it's a limited solution. We can only scale it so much.

mchelen-gov commented 3 years ago

@drorva yeah i think it's true about metalsmith loading all the data into memory? maybe https://github.com/axa-ch/metalsmith-incremental would help with this though idk what the implementation difficulty is

alexandec commented 3 years ago

Thank you @drorva! Glad all the information was useful. I like your suggestions! Here are some responses:

How do 50k nodes = 1GB? Are nodes really 20k on average or are there additional artifacts?

It looks like some nodes are small but some are quite large. For example, the GraphQL data for https://staging.va.gov/butler-health-care/ is 364k in size. It contains 564 separate entities. You can see all of them by visiting that page and viewing the console. One question is whether all those entities are actually needed. Maybe there's some optimization we can do there.

Another interesting comparison is between the current GraphQL data (13k nodes, 81mb) and the test data (48k nodes, 1000mb). It seems like the test data is larger than expected. However I talked a bit with @indytechcook about this and my recollection is that the test data is basically created by cloning existing VA facilities multiple times, so it should be representative of what we'll actually see at 48k nodes.

We are using multiple cores on the machine? That can make a huge difference.

I'm not sure about this, but I can check if there's a way to view CPU usage with Prometheus. I'm also not sure to what degree Metalsmith is parallelizing its execution. But I like your thinking here. Maybe there are some build steps where we could add/increase parallelization to improve performance.

One thing we are doing is running three builds in parallel. But beyond that, maybe there are still some cores sitting idle.

It might sound counter-intuitive, but part of the problem is that metalsmith is doing everything in memory. The good news is that the code is tiny so it's pretty easy to determine that it uses the "ware" package and it should seems like it shouldn't be too complicated (famous last words :-)) to change ware to use a file name instead of the actual object and save and read to file between steps. I suspect that if you do that, you'll find the memory footprint shrinking, but the time going up, possibly significantly. In that case you might want to experiment with keep some steps in memory and only saving and reading from disk at specific steps.

Yes good point! This would be really interesting to investigate and test. I'm not sure exactly how the EC2 instances access their disk storage (or what type of disks they have) but this could potentially save a lot of memory without too much of a performance hit. At least that might be true for some steps, as you mention.

If the files are all written to disk, It might also be possible to "farm out" some string replacements to command line tools like sed. It might not work in all cases, but for simple replacements it could be faster.

When all else fails, we can look at using a beefier machine. The problem with that is it's a limited solution. We can only scale it so much.

Agreed, I think we should keep that as an option, but we could hit a dead end if we rely on it too soon. My intuition is that Metalsmith's approach, and some of our current build steps, could still be optimized further. That will probably require deeper digging than we've done in the past but it still seems feasible.

alexandec commented 3 years ago

@mchelen-gov that metalsmith-incremental plugin looks interesting, thanks! I wish the README was a bit clearer on what it's actually doing. But if it's caching the files object to disk, then for future runs checking to see what changed and only running the metalsmith build only on what changed, that could help us a lot. I'll look into it more.

timwright12 commented 3 years ago

We did investigate metalsmith-incremental at one point, it wasn't a drop-in fix for us, but there's likely a good amount of guidance we can take from it to build something custom. I like all the recommendations, esp. "Stop storing downloaded file contents in the files object" - seems like we could get a big gain there. I'm curious to see the build time after we implement these.

Ultimately, 1hr is for sure a problem, but we'll probably start hitting workflow issues around 30mins and increased pipeline times as tests and nodes are added.

drorva commented 3 years ago

The biggest quickest win is probably the multiple core/thread question. Node by default runs in a single core. Using multiple cores for something like a http server is easy using third party products. I suspect in our case, though, just enabling multiple cores is not enough and you'll probably need to do some surgery, while keeping in mind concurrency issues that can get hairy quickly. Looks like https://nodejs.org/api/cluster.html#cluster_cluster is what you need. Off the top of my head:

I would experiment with breaking the work into n threads (start with 2 and go from there?).
Allocate a node to a thread
Allocate a node to the next thread
when thread is done with that node, allocate a new node to the thread
rinse, repeat

Looking at what I wrote above, I'm questioning the "quickest" part, but it still feels like the best return on investment. I would guess that you'll end up with using slightly more memory, but running a lot faster specifically once you start using all the cores/threads. Oh, and the library will report how many threads you have.

alexandec commented 3 years ago

@timwright12 great, we can include investigation of metalsmith-incremental in our list of future areas to explore, even if we can't use it as is. I suspect one of the challenges will be knowing what needs to be rebuilt based on changes in the GraphQL queries. If the entity for an individual page changes, we know that page has to be rebuilt. But things like header/footer changes, sidebar changes, etc could be more challenging.

@drorva thanks for those pointers! Cluster does look promising. One thing that will be interesting to see is, to what degree does the templating engine require access to a variety of properties in the files object in order to render a single HTML page? And if a lot of access is needed, is there an easy way to share the files object between multiple threads? In any case, I agree that some significant performance gains are possible with this approach.

drorva commented 3 years ago

@alexandec good point. The approach that I described is dependent on the processing of each node being independent of the processing of other nodes. At some point in the pipeline, that approach probably breaks, and from that point on, the handling can't be parallelized.

alexandec commented 3 years ago

Areas to investigate

Our plan is first to implement the changes under Recommendations above. After that we'll move into another investigation phase where we'll determine what to tackle in the next implementation round. Some potential areas of investigation include:

Use multiple threads (see Cluster and Dror's comments above)
Instead of keeping the files object in memory, store files on disk throughout the build process (see Dror's comment above)
Perform an overall performance profile of the build to find hot spots
Implement incremental builds (we could possibly learn from the metalsmith-incremental plugin here, see Mike and Tim's comments above). If we can do incremental builds, that represents a large scalability benefit since it would enable fast content releases regardless of total build size. Note however that full builds will still be required in some cases.
Investigate the metalsmith-layouts and metalsmith-permalinks plugins for inefficiencies (these are slow build steps)
Investigate the liquid filters to see if any of them can be optimized. They are run many times when HTML pages are being rendered so optimization could have benefits to build time and memory use
See if we can avoid parsing all HTML files with Cheerio in the final step (the transformations we perform are fast, but parsing is slow)
Increase processing power and/or available memory (not our first choice because as Dror notes above, it has limits)
Test the 48k node content build on GitHub Actions (important since we are in the process of migrating from Jenkins to GHA)

I'm currently looking into testing the 48k content build on our GHA self-hosted-runner, which will require a different runner type to provide enough memory (at least 25gb is needed). This testing is important in order to surface issues early, if there are any. I created an Ops ticket to increase available memory.

alexandec commented 3 years ago

Findings so far running the content build on our GitHub Actions self-hosted runners:

On Jenkins all 3 builds (dev, staging, prod) run on the same EC2 instance in parallel. With GHA self-hosted runners, each build runs on a separate EC2 instance.
The content build takes about 45 minutes on GHA using the same configuration as we had with Jenkins. I assume this is mostly due to each build having an instance all to itself, but there may be other differences as well. For example there's no Docker container by default on self-hosted GHA instances.
Currently we only have 2 self-hosted runners available for the content-build repo. That means only 2 builds can run in parallel. The third build waits until the first two are finished before it starts running. However Ops can adjust the number of self-hosted runners.
Ops is looking at a variety of approaches for how to set the size of the runner pool for each repo, potentially how to autoscale runners, or to allow each job to specify the number and EC2 instance type of its runners (see Slack discussion).
Running each build on its own instance, in addition to being faster, also means we have more memory available. Scaling becomes easier.
I tried turning off manual garbage collection since there's more memory available. This further reduced build time to about 41 minutes.

alexandec commented 3 years ago

Closing this ticket -- further efforts will continue in other tickets (see Recommendations section above).