Closed alexandec closed 3 years ago
Have a working approach to streaming 1gb of JSON into a single file. Now working on reading the data in a similar fashion.
To put 1gb of data into a single JSON file I opted to write each page (nodeQuery entity) to the file, one at a time. The other queries are smaller so we can stringify each query in one chunk.
function saveCacheFile(cacheFilePath, pages) {
const outputStream = fs.createWriteStream(cacheFilePath);
outputStream.write('{\n "data": {\n');
const queries = Object.keys(pages.data);
queries.forEach((query, queryIndex) => {
const props = pages.data[query];
outputStream.write(`"${query}": `);
if (query === 'nodeQuery') {
// Write each entity separately to avoid 500mb string length limit
outputStream.write('{\n"entities": [\n');
props.entities.forEach((entity, entityIndex) => {
outputStream.write(JSON.stringify(entity, null, 2));
if (entityIndex < props.entities.length - 1) outputStream.write(',');
outputStream.write('\n');
});
outputStream.write(']\n},\n');
} else {
// These queries are smaller so we can just stringify them for simplicity
outputStream.write(JSON.stringify(props, null, 2));
if (queryIndex < queries.length - 1) outputStream.write(',');
outputStream.write('\n');
}
});
outputStream.write('}\n}\n');
outputStream.end();
}
When it comes to reading the 1gb JSON file, I found that the JSONStream module handles it well:
async function loadCacheFile(cacheFilePath) {
return new Promise(resolve => {
const stream = fs.createReadStream(cacheFilePath, { encoding: 'utf8' });
const parser = JSONStream.parse('*').on('data', data => {
resolve({ data });
});
stream.pipe(parser);
});
}
Next step is to benchmark this and compare with writing each page to a separate file.
For comparison, here's the implementation to store each page in a separate JSON file:
function createJSONCacheFiles(cacheDir, drupalPages) {
const pageCachePath = path.join(cacheDir, DRUPAL_PAGE_CACHE_DIR);
const otherCachePath = path.join(cacheDir, DRUPAL_OTHER_CACHE_DIR);
fs.emptyDirSync(cacheDir);
// Store each nodeQuery entity in a separate file to avoid hitting size limit
drupalPages.data.nodeQuery.entities.forEach(page => {
const fullPath = path.join(pageCachePath, page.entityId);
fs.outputJsonSync(fullPath, page, { spaces: 2 });
});
// The other queries are smaller so we can store the entire query in a file
Object.keys(drupalPages.data)
.filter(queryName => queryName !== 'nodeQuery')
.forEach(queryName => {
const fullPath = path.join(otherCachePath, queryName);
fs.outputJsonSync(fullPath, drupalPages.data[queryName], { spaces: 2 });
});
}
And here's the code to load the individual JSON cache files into memory:
function loadJSONCacheFiles(cacheDir) {
const pageCachePath = path.join(cacheDir, DRUPAL_PAGE_CACHE_DIR);
const otherCachePath = path.join(cacheDir, DRUPAL_OTHER_CACHE_DIR);
const result = { data: { nodeQuery: { entities: [] } } };
// Load page query cache, one file per page
const files = fs.readdirSync(pageCachePath);
files.forEach(file => {
const contents = fs.readJsonSync(path.join(pageCachePath, file));
result.data.nodeQuery.entities.push(contents);
});
// Load other query caches, one file per query
const otherFiles = fs.readdirSync(otherCachePath);
otherFiles.forEach(file => {
const contents = fs.readJsonSync(path.join(otherCachePath, file));
result.data[file] = contents;
});
return result;
}
See implementation for both methods above.
Read: 26 seconds Write: 11 seconds
Read: 9 seconds Write: 14 seconds
Based on these local benchmarks, the single file JSON caching approach reduces write time by 3 seconds but increases read time by 15 seconds when compared to the multiple file approach. The time difference between the two approaches is negligible in the context of the entire content build, which takes about 40 minutes with 48k nodes.
It's possible that as the cache grows we could eventually run into the maximum file size. Most likely this is at least 2tb which is well beyond the amount of data we will ever have in the cache. I asked about this and will report back with any findings.
The single file method is what we use currently so there are fewer code changes required. Also, for operations like clearing the cache or checking cache state, the single file is simpler. For those reasons I favor the single file approach.
Both methods write and read files synchronously. There are potential speed improvements to be had with an async approach. Also, I haven't benchmarked other JSON streaming read approaches beyond the JSONStream module. It's possible a different approach would be faster.
However, since the total execution time is low for both methods, I recommend focusing optimization effort on other parts of the build process first. Notably the "Apply layouts" step takes about 16 minutes and the "Parse a virtual DOM from every .html file" step takes about 10 minutes.
Another useful optimization would be to add a --drupal-no-local-cache
option to disable the local cache entirely (see discussion with Nick's suggestion). While the cache is useful for local testing, it's not needed during Jenkins builds. These builds only run once and can take care of everything in memory without writing or reading from the filesystem cache. I created a ticket to add this option.
I encountered occasional tinyliquid timeouts at 48k nodes. These warrant further investigation because they cause the build to fail. We are already attempting to increase the timeout as follows:
new liquid.Context({ timeout: 1200000 });
However I suspect this is a no-op and the default 2 minute timeout is still in place. 2 minutes may be insufficient when too many files are processed in parallel. We could try adding the following to our metalsmith-layouts plugin configuration:
"engineOptions": {
"timeout": 1200000
}
Then we could add debugging output to the tinyliquid module to verify that the larger 20 minute timeout is in place. Here's the relevant ticket.
@timwright12 any thoughts on these findings, or improvements I can make?
@alexandec (cc @meganhkelley )
Cache
I think the --drupal-no-local-cache
makes sense. I would just confirm that we're not using it in the pipeline anywhere (https://github.com/department-of-veterans-affairs/vets-website/blob/master/jenkins/common.groovy#L384) - I don't think we are.
Timeouts
for the new liquid.Context({ timeout: 1200000 });
that was added because the tinyliquid templates were taking too long. If engineOptions
works, I'm fine with that as well
Recommendations I actually lean a bit toward the separate cache file for each page approach because it feels like it would be a quicker path if we needed to implement incremental builds (checking each file and only updating the changed one). If you don't think using a single file would be preventative to building incrementally I'm fine with that. You've been the one with your head the code.
Either way, let's document that decision in https://github.com/department-of-veterans-affairs/va.gov-team/tree/master/platform/engineering/frontend (just a doc like decision-metalsmith-cache.md
- whatever you want to call it).
I think you could just use the content of your last three comments ^ as the content
@timwright12 thanks! I brought up that incremental build question at the Content Release Engineering Sync meeting last Tuesday. Their current thinking is to have Drupal determine what needs to be updated, pipe just that content into the build process, and replace just those static HTML pages on the site. The local JSON cache wouldn't be needed with that approach so they didn't think the caching method mattered.
The more I think about it, the local cache really only seems relevant for local development purposes, to save time. For prod/staging/dev we'll either be generating the entire site or doing small incremental updates, but either way, Drupal will provide the latest data as needed for that task.
The documentation you mention sounds good, I'll add that.
PR documenting this decision: https://github.com/department-of-veterans-affairs/va.gov-team/pull/23787
PR is merged, CMS team is good with the decision. Moving on to implementation.
Details
Recommended in #20764
Currently we store the entire GraphQL cache in a single JSON file. File size is limited to 500mb due to the string length limit in the V8 engine used by NodeJS. As we scale we'll need to expand the cache beyond 500mb. Two possible approaches are:
Investigate and test these two approaches (plus others if they become apparent), write up the results, and choose the best one.
Acceptance Criteria