Closed bkjohnson closed 3 years ago
I just ran the content build with yarn build:content --verbose --pull-drupal
on the latest master and it resulted in the following table.
Step | Description | Time Elapsed | Memory Used This Step | Total Memory Used After Step |
---|---|---|---|---|
0 | Preserving Webpack build output | 380ms | 2.9mB | 38.52mB |
1 | Create React pages | 1ms | 0.18mB | 38.72mB |
2 | Get Drupal content | 345783ms | 171.33mB | 210.07mB |
3 | Add Drupal Prefix | 5ms | 2.28mB | 212.37mB |
4 | Create Outreach Assets Data | 6ms | 0.51mB | 212.89mB |
5 | Create environment filter | 1ms | 0.1mB | 213.01mB |
6 | Add filenames for debugging | 0ms | 0.2mB | 213.22mB |
7 | Check collections | 1ms | 0.21mB | 213.45mB |
8 | Group collections | 529ms | 0.9mB | 214.37mB |
9 | Reset left rail navigation menu levels | 1ms | 0.33mB | 214.71mB |
10 | Add the date to filenames | 25ms | 6.6mB | 221.33mB |
11 | Add app assets | 176ms | -4.37mB | 216.98mB |
12 | Add content assets | 93ms | -7.31mB | 209.68mB |
13 | Plug the content into the templates | 129ms | 7.16mB | 216.86mB |
14 | Translate the markdown to html | 16ms | 1.16mB | 218.03mB |
15 | Add permalinks and change foo.md to foo/index.html | 435ms | 4.35mB | 222.4mB |
16 | Create header and footer | 6ms | 1.13mB | 223.54mB |
17 | Generate navigation | 230ms | 3.01mB | 226.57mB |
18 | Apply layouts | 13896ms | 92.47mB | 319.06mB |
19 | Rewrite VA domains for the buildtype | 1486ms | -88.42mB | 230.66mB |
20 | Rewrite Drupal pages | 0ms | 0.01mB | 230.68mB |
21 | Create Drupal debug page | 1ms | 0.53mB | 231.23mB |
22 | Download Drupal assets | 82905ms | -11.2mB | 220.04mB |
23 | Read application assets from disk | 430ms | -0.87mB | 219.19mB |
24 | Create sitemap | 57ms | 1.77mB | 220.98mB |
25 | Update robots.txt | 0ms | 0.01mB | 221mB |
26 | Check for CMS URLs | 978ms | 18.53mB | 239.54mB |
27 | Parse HTML files | 27446ms | 4016.12mB | 4255.68mB |
28 | Add nonce to script tags | 1262ms | 82.52mB | 4338.22mB |
29 | Process [data-entry-name] attributes into Webpack asset paths | 399ms | 1.46mB | 4339.69mB |
30 | Update external links | 315ms | 9.9mB | 4349.61mB |
31 | Add IDs to subheadings | 552ms | 12.04mB | 4361.66mB |
32 | Check for broken links | 3512ms | 352.35mB | 4714.03mB |
33 | Inject axe-core for accessibility | 4ms | 0.88mB | 4714.92mB |
34 | Save the changes from the modified DOM | 2876ms | -3651.54mB | 1063.4mB |
This doesn't reflect the peak memory usage, though—just the memory between steps.
Looks like the parsed HTML takes up the bulk of the memory. I wonder if there's a way to serialize / deserialize the parsed Cheerio data structure. If so, we could potentially parse each HTML file, write it to disk, and in each step that requires parsed HTML, read the serialized data structure from disk, perform the operations, re-serialize it, and write it to disk. It would be slower, but, we'd save on memory usage.
As an alternative to @cvalarida's suggestion, we could parse just each HTML file one at a time, do all of the required operations to that file, then tear down that virtual DOM before moving on to the next file. It would require some restructuring to the build steps between parseHtml
and replaceContentsWithDom
, but it would effectively eliminate the memory issue.
That's the way the pipeline was structured once upon a time, but the time complexity outweighed the memory constraint at the time. I actually changed it to parse the HTML in a single step to save I think it was 30 seconds at the time. That's probably quite a lot more now.
That said, trading time for memory is definitely a possibility. :+1: It may just make content builds a couple minutes slower, which I think will receive push back and may require DEPO approval. I'm not sure.
I think we might be able to get the best of both worlds by restructuring the plugins that require the virtual DOM of each page. It would require making them "sub-plugins" of parse_html instead of true Metalsmith plugins...basically it would -
for file in files
parse_html(file)
add_nonce(file)
add_entry_name(file
save_modified_dom(file)
You would need to be creative with what to do when there are plugins that require the virtual DOM of more than one file at a time because the plugin/step wouldn't be able to execute on a single file in isolation. Hopefully there aren't any.
If there are plugins that require access to the whole files
object that is the first argument of a Metalsmith plugin, you would have to figure out something to do there too. This is the case for the broken link checker, I think.
Personally, I think this is definitely worth prioritizing. We might see some benefits in Jenkins.
I am addressing this issue using the approach outlined above in this PR, https://github.com/department-of-veterans-affairs/vets-website/pull/15601
Duplicative of #20764
Issue Description
The content from the Drupal CMS is getting larger, and we have noticed our content build process fail a few times due to running out of memory (Example). We have increased the alloted heap size from 4GB to 8GB, but this is not a sustainable solution.
Goal
The goal is to improve the content build so that it requires less memory.
Acceptance Criteria
vets-website
content build memory usage to an "acceptable" MB/page level - preferably under 4GB total