Improve memory usage during the content build

bkjohnson commented 4 years ago

Issue Description

The content from the Drupal CMS is getting larger, and we have noticed our content build process fail a few times due to running out of memory (Example). We have increased the alloted heap size from 4GB to 8GB, but this is not a sustainable solution.

Goal

The goal is to improve the content build so that it requires less memory.

Acceptance Criteria

[ ] Do some discovery around current number of pages/nodes in the CMS to determine a feasible memory limit
[ ] Document findings & memory usage per page
[ ] Reduce vets-website content build memory usage to an "acceptable" MB/page level - preferably under 4GB total

cvalarida commented 4 years ago

I just ran the content build with yarn build:content --verbose --pull-drupal on the latest master and it resulted in the following table.

Step	Description	Time Elapsed	Memory Used This Step	Total Memory Used After Step
0	Preserving Webpack build output	380ms	2.9mB	38.52mB
1	Create React pages	1ms	0.18mB	38.72mB
2	Get Drupal content	345783ms	171.33mB	210.07mB
3	Add Drupal Prefix	5ms	2.28mB	212.37mB
4	Create Outreach Assets Data	6ms	0.51mB	212.89mB
5	Create environment filter	1ms	0.1mB	213.01mB
6	Add filenames for debugging	0ms	0.2mB	213.22mB
7	Check collections	1ms	0.21mB	213.45mB
8	Group collections	529ms	0.9mB	214.37mB
9	Reset left rail navigation menu levels	1ms	0.33mB	214.71mB
10	Add the date to filenames	25ms	6.6mB	221.33mB
11	Add app assets	176ms	-4.37mB	216.98mB
12	Add content assets	93ms	-7.31mB	209.68mB
13	Plug the content into the templates	129ms	7.16mB	216.86mB
14	Translate the markdown to html	16ms	1.16mB	218.03mB
15	Add permalinks and change foo.md to foo/index.html	435ms	4.35mB	222.4mB
16	Create header and footer	6ms	1.13mB	223.54mB
17	Generate navigation	230ms	3.01mB	226.57mB
18	Apply layouts	13896ms	92.47mB	319.06mB
19	Rewrite VA domains for the buildtype	1486ms	-88.42mB	230.66mB
20	Rewrite Drupal pages	0ms	0.01mB	230.68mB
21	Create Drupal debug page	1ms	0.53mB	231.23mB
22	Download Drupal assets	82905ms	-11.2mB	220.04mB
23	Read application assets from disk	430ms	-0.87mB	219.19mB
24	Create sitemap	57ms	1.77mB	220.98mB
25	Update robots.txt	0ms	0.01mB	221mB
26	Check for CMS URLs	978ms	18.53mB	239.54mB
27	Parse HTML files	27446ms	4016.12mB	4255.68mB
28	Add nonce to script tags	1262ms	82.52mB	4338.22mB
29	Process [data-entry-name] attributes into Webpack asset paths	399ms	1.46mB	4339.69mB
30	Update external links	315ms	9.9mB	4349.61mB
31	Add IDs to subheadings	552ms	12.04mB	4361.66mB
32	Check for broken links	3512ms	352.35mB	4714.03mB
33	Inject axe-core for accessibility	4ms	0.88mB	4714.92mB
34	Save the changes from the modified DOM	2876ms	-3651.54mB	1063.4mB

This doesn't reflect the peak memory usage, though—just the memory between steps.

cvalarida commented 4 years ago

Looks like the parsed HTML takes up the bulk of the memory. I wonder if there's a way to serialize / deserialize the parsed Cheerio data structure. If so, we could potentially parse each HTML file, write it to disk, and in each step that requires parsed HTML, read the serialized data structure from disk, perform the operations, re-serialize it, and write it to disk. It would be slower, but, we'd save on memory usage.

ncksllvn commented 4 years ago

As an alternative to @cvalarida's suggestion, we could parse just each HTML file one at a time, do all of the required operations to that file, then tear down that virtual DOM before moving on to the next file. It would require some restructuring to the build steps between parseHtml and replaceContentsWithDom, but it would effectively eliminate the memory issue.

cvalarida commented 4 years ago

That's the way the pipeline was structured once upon a time, but the time complexity outweighed the memory constraint at the time. I actually changed it to parse the HTML in a single step to save I think it was 30 seconds at the time. That's probably quite a lot more now.

That said, trading time for memory is definitely a possibility. :+1: It may just make content builds a couple minutes slower, which I think will receive push back and may require DEPO approval. I'm not sure.

ncksllvn commented 4 years ago

I think we might be able to get the best of both worlds by restructuring the plugins that require the virtual DOM of each page. It would require making them "sub-plugins" of parse_html instead of true Metalsmith plugins...basically it would -

for file in files
    parse_html(file)
    add_nonce(file)
    add_entry_name(file
    save_modified_dom(file)

You would need to be creative with what to do when there are plugins that require the virtual DOM of more than one file at a time because the plugin/step wouldn't be able to execute on a single file in isolation. Hopefully there aren't any.

If there are plugins that require access to the whole files object that is the first argument of a Metalsmith plugin, you would have to figure out something to do there too. This is the case for the broken link checker, I think.

ncksllvn commented 4 years ago

Personally, I think this is definitely worth prioritizing. We might see some benefits in Jenkins.

ncksllvn commented 3 years ago

I am addressing this issue using the approach outlined above in this PR, https://github.com/department-of-veterans-affairs/vets-website/pull/15601

meganhkelley commented 3 years ago

Duplicative of #20764

department-of-veterans-affairs / va.gov-team