Discovery: increase GraphQL cache size limit beyond 500mb

alexandec commented 3 years ago

Details

Recommended in #20764

Currently we store the entire GraphQL cache in a single JSON file. File size is limited to 500mb due to the string length limit in the V8 engine used by NodeJS. As we scale we'll need to expand the cache beyond 500mb. Two possible approaches are:

Store each page in a separate file, so maximum file size equals maximum page size, which is well below 500mb
Continue to store the cache in a single large file, but use stream-json which should be able to handle much larger JSON files

Investigate and test these two approaches (plus others if they become apparent), write up the results, and choose the best one.

Acceptance Criteria

[x] Investigate approaches to store over 500mb of data in the cache
[x] Write up results and choose the best method

alexandec commented 3 years ago

Have a working approach to streaming 1gb of JSON into a single file. Now working on reading the data in a similar fashion.

alexandec commented 3 years ago

All pages in single cache file (streaming)

To put 1gb of data into a single JSON file I opted to write each page (nodeQuery entity) to the file, one at a time. The other queries are smaller so we can stringify each query in one chunk.

function saveCacheFile(cacheFilePath, pages) {
  const outputStream = fs.createWriteStream(cacheFilePath);
  outputStream.write('{\n  "data": {\n');

  const queries = Object.keys(pages.data);
  queries.forEach((query, queryIndex) => {
    const props = pages.data[query];

    outputStream.write(`"${query}": `);
    if (query === 'nodeQuery') {
      // Write each entity separately to avoid 500mb string length limit
      outputStream.write('{\n"entities": [\n');
      props.entities.forEach((entity, entityIndex) => {
        outputStream.write(JSON.stringify(entity, null, 2));
        if (entityIndex < props.entities.length - 1) outputStream.write(',');
        outputStream.write('\n');
      });
      outputStream.write(']\n},\n');
    } else {
      // These queries are smaller so we can just stringify them for simplicity
      outputStream.write(JSON.stringify(props, null, 2));
      if (queryIndex < queries.length - 1) outputStream.write(',');
      outputStream.write('\n');
    }
  });

  outputStream.write('}\n}\n');
  outputStream.end();
}

When it comes to reading the 1gb JSON file, I found that the JSONStream module handles it well:

async function loadCacheFile(cacheFilePath) {
  return new Promise(resolve => {
    const stream = fs.createReadStream(cacheFilePath, { encoding: 'utf8' });
    const parser = JSONStream.parse('*').on('data', data => {
      resolve({ data });
    });
    stream.pipe(parser);
  });
}

Next step is to benchmark this and compare with writing each page to a separate file.

alexandec commented 3 years ago

Each page in separate cache file

For comparison, here's the implementation to store each page in a separate JSON file:

function createJSONCacheFiles(cacheDir, drupalPages) {
  const pageCachePath = path.join(cacheDir, DRUPAL_PAGE_CACHE_DIR);
  const otherCachePath = path.join(cacheDir, DRUPAL_OTHER_CACHE_DIR);
  fs.emptyDirSync(cacheDir);

  // Store each nodeQuery entity in a separate file to avoid hitting size limit
  drupalPages.data.nodeQuery.entities.forEach(page => {
    const fullPath = path.join(pageCachePath, page.entityId);
    fs.outputJsonSync(fullPath, page, { spaces: 2 });
  });

  // The other queries are smaller so we can store the entire query in a file
  Object.keys(drupalPages.data)
    .filter(queryName => queryName !== 'nodeQuery')
    .forEach(queryName => {
      const fullPath = path.join(otherCachePath, queryName);
      fs.outputJsonSync(fullPath, drupalPages.data[queryName], { spaces: 2 });
    });
}

And here's the code to load the individual JSON cache files into memory:

function loadJSONCacheFiles(cacheDir) {
  const pageCachePath = path.join(cacheDir, DRUPAL_PAGE_CACHE_DIR);
  const otherCachePath = path.join(cacheDir, DRUPAL_OTHER_CACHE_DIR);
  const result = { data: { nodeQuery: { entities: [] } } };

  // Load page query cache, one file per page
  const files = fs.readdirSync(pageCachePath);
  files.forEach(file => {
    const contents = fs.readJsonSync(path.join(pageCachePath, file));
    result.data.nodeQuery.entities.push(contents);
  });

  // Load other query caches, one file per query
  const otherFiles = fs.readdirSync(otherCachePath);
  otherFiles.forEach(file => {
    const contents = fs.readJsonSync(path.join(otherCachePath, file));
    result.data[file] = contents;
  });

  return result;
}

alexandec commented 3 years ago

JSON cache benchmarks (48k nodes, 1gb cache size, average of 5 runs)

See implementation for both methods above.

Single cache file for all pages (streaming)

Read: 26 seconds Write: 11 seconds

Separate cache file for each page

Read: 9 seconds Write: 14 seconds

Summary and recommendations

Based on these local benchmarks, the single file JSON caching approach reduces write time by 3 seconds but increases read time by 15 seconds when compared to the multiple file approach. The time difference between the two approaches is negligible in the context of the entire content build, which takes about 40 minutes with 48k nodes.

It's possible that as the cache grows we could eventually run into the maximum file size. Most likely this is at least 2tb which is well beyond the amount of data we will ever have in the cache. I asked about this and will report back with any findings.

The single file method is what we use currently so there are fewer code changes required. Also, for operations like clearing the cache or checking cache state, the single file is simpler. For those reasons I favor the single file approach.

Potential improvements

Both methods write and read files synchronously. There are potential speed improvements to be had with an async approach. Also, I haven't benchmarked other JSON streaming read approaches beyond the JSONStream module. It's possible a different approach would be faster.

However, since the total execution time is low for both methods, I recommend focusing optimization effort on other parts of the build process first. Notably the "Apply layouts" step takes about 16 minutes and the "Parse a virtual DOM from every .html file" step takes about 10 minutes.

Another useful optimization would be to add a --drupal-no-local-cache option to disable the local cache entirely (see discussion with Nick's suggestion). While the cache is useful for local testing, it's not needed during Jenkins builds. These builds only run once and can take care of everything in memory without writing or reading from the filesystem cache. I created a ticket to add this option.

Timeouts

I encountered occasional tinyliquid timeouts at 48k nodes. These warrant further investigation because they cause the build to fail. We are already attempting to increase the timeout as follows:

new liquid.Context({ timeout: 1200000 });

However I suspect this is a no-op and the default 2 minute timeout is still in place. 2 minutes may be insufficient when too many files are processed in parallel. We could try adding the following to our metalsmith-layouts plugin configuration:

"engineOptions": {
  "timeout": 1200000
}

Then we could add debugging output to the tinyliquid module to verify that the larger 20 minute timeout is in place. Here's the relevant ticket.

alexandec commented 3 years ago

@timwright12 any thoughts on these findings, or improvements I can make?

timwright12 commented 3 years ago

@alexandec (cc @meganhkelley )

Cache I think the --drupal-no-local-cache makes sense. I would just confirm that we're not using it in the pipeline anywhere (https://github.com/department-of-veterans-affairs/vets-website/blob/master/jenkins/common.groovy#L384) - I don't think we are.

Timeouts for the new liquid.Context({ timeout: 1200000 }); that was added because the tinyliquid templates were taking too long. If engineOptions works, I'm fine with that as well

Recommendations I actually lean a bit toward the separate cache file for each page approach because it feels like it would be a quicker path if we needed to implement incremental builds (checking each file and only updating the changed one). If you don't think using a single file would be preventative to building incrementally I'm fine with that. You've been the one with your head the code.

Either way, let's document that decision in https://github.com/department-of-veterans-affairs/va.gov-team/tree/master/platform/engineering/frontend (just a doc like decision-metalsmith-cache.md - whatever you want to call it).

I think you could just use the content of your last three comments ^ as the content

alexandec commented 3 years ago

@timwright12 thanks! I brought up that incremental build question at the Content Release Engineering Sync meeting last Tuesday. Their current thinking is to have Drupal determine what needs to be updated, pipe just that content into the build process, and replace just those static HTML pages on the site. The local JSON cache wouldn't be needed with that approach so they didn't think the caching method mattered.

The more I think about it, the local cache really only seems relevant for local development purposes, to save time. For prod/staging/dev we'll either be generating the entire site or doing small incremental updates, but either way, Drupal will provide the latest data as needed for that task.

The documentation you mention sounds good, I'll add that.

alexandec commented 3 years ago

PR documenting this decision: https://github.com/department-of-veterans-affairs/va.gov-team/pull/23787

alexandec commented 3 years ago

PR is merged, CMS team is good with the decision. Moving on to implementation.

department-of-veterans-affairs / va.gov-team