department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
283 stars 204 forks source link

Improve memory usage during the content build #13821

Closed bkjohnson closed 3 years ago

bkjohnson commented 4 years ago

Issue Description

The content from the Drupal CMS is getting larger, and we have noticed our content build process fail a few times due to running out of memory (Example). We have increased the alloted heap size from 4GB to 8GB, but this is not a sustainable solution.

Goal

The goal is to improve the content build so that it requires less memory.

Acceptance Criteria

cvalarida commented 4 years ago

I just ran the content build with yarn build:content --verbose --pull-drupal on the latest master and it resulted in the following table.

Step Description Time Elapsed Memory Used This Step Total Memory Used After Step
0 Preserving Webpack build output 380ms 2.9mB 38.52mB
1 Create React pages 1ms 0.18mB 38.72mB
2 Get Drupal content 345783ms 171.33mB 210.07mB
3 Add Drupal Prefix 5ms 2.28mB 212.37mB
4 Create Outreach Assets Data 6ms 0.51mB 212.89mB
5 Create environment filter 1ms 0.1mB 213.01mB
6 Add filenames for debugging 0ms 0.2mB 213.22mB
7 Check collections 1ms 0.21mB 213.45mB
8 Group collections 529ms 0.9mB 214.37mB
9 Reset left rail navigation menu levels 1ms 0.33mB 214.71mB
10 Add the date to filenames 25ms 6.6mB 221.33mB
11 Add app assets 176ms -4.37mB 216.98mB
12 Add content assets 93ms -7.31mB 209.68mB
13 Plug the content into the templates 129ms 7.16mB 216.86mB
14 Translate the markdown to html 16ms 1.16mB 218.03mB
15 Add permalinks and change foo.md to foo/index.html 435ms 4.35mB 222.4mB
16 Create header and footer 6ms 1.13mB 223.54mB
17 Generate navigation 230ms 3.01mB 226.57mB
18 Apply layouts 13896ms 92.47mB 319.06mB
19 Rewrite VA domains for the buildtype 1486ms -88.42mB 230.66mB
20 Rewrite Drupal pages 0ms 0.01mB 230.68mB
21 Create Drupal debug page 1ms 0.53mB 231.23mB
22 Download Drupal assets 82905ms -11.2mB 220.04mB
23 Read application assets from disk 430ms -0.87mB 219.19mB
24 Create sitemap 57ms 1.77mB 220.98mB
25 Update robots.txt 0ms 0.01mB 221mB
26 Check for CMS URLs 978ms 18.53mB 239.54mB
27 Parse HTML files 27446ms 4016.12mB 4255.68mB
28 Add nonce to script tags 1262ms 82.52mB 4338.22mB
29 Process [data-entry-name] attributes into Webpack asset paths 399ms 1.46mB 4339.69mB
30 Update external links 315ms 9.9mB 4349.61mB
31 Add IDs to subheadings 552ms 12.04mB 4361.66mB
32 Check for broken links 3512ms 352.35mB 4714.03mB
33 Inject axe-core for accessibility 4ms 0.88mB 4714.92mB
34 Save the changes from the modified DOM 2876ms -3651.54mB 1063.4mB

This doesn't reflect the peak memory usage, though—just the memory between steps.

cvalarida commented 4 years ago

Looks like the parsed HTML takes up the bulk of the memory. I wonder if there's a way to serialize / deserialize the parsed Cheerio data structure. If so, we could potentially parse each HTML file, write it to disk, and in each step that requires parsed HTML, read the serialized data structure from disk, perform the operations, re-serialize it, and write it to disk. It would be slower, but, we'd save on memory usage.

ncksllvn commented 4 years ago

As an alternative to @cvalarida's suggestion, we could parse just each HTML file one at a time, do all of the required operations to that file, then tear down that virtual DOM before moving on to the next file. It would require some restructuring to the build steps between parseHtml and replaceContentsWithDom, but it would effectively eliminate the memory issue.

cvalarida commented 4 years ago

That's the way the pipeline was structured once upon a time, but the time complexity outweighed the memory constraint at the time. I actually changed it to parse the HTML in a single step to save I think it was 30 seconds at the time. That's probably quite a lot more now.

That said, trading time for memory is definitely a possibility. :+1: It may just make content builds a couple minutes slower, which I think will receive push back and may require DEPO approval. I'm not sure.

ncksllvn commented 4 years ago

I think we might be able to get the best of both worlds by restructuring the plugins that require the virtual DOM of each page. It would require making them "sub-plugins" of parse_html instead of true Metalsmith plugins...basically it would -

for file in files
    parse_html(file)
    add_nonce(file)
    add_entry_name(file
    save_modified_dom(file)

You would need to be creative with what to do when there are plugins that require the virtual DOM of more than one file at a time because the plugin/step wouldn't be able to execute on a single file in isolation. Hopefully there aren't any.

If there are plugins that require access to the whole files object that is the first argument of a Metalsmith plugin, you would have to figure out something to do there too. This is the case for the broken link checker, I think.

ncksllvn commented 4 years ago

Personally, I think this is definitely worth prioritizing. We might see some benefits in Jenkins.

ncksllvn commented 3 years ago

I am addressing this issue using the approach outlined above in this PR, https://github.com/department-of-veterans-affairs/vets-website/pull/15601

meganhkelley commented 3 years ago

Duplicative of #20764