POC: CMS is able to collect cache invalidation information

timcosgrove commented 1 year ago

Description

Based on the findings of #14399, we should create a working POC that demonstrates collecting invalidated cache items when an entity or config is saved.

Potential solutions

In addition to solutions identified in #14399, there are existing contributed modules that can potentially provide this functionality. The Purge contributed module looks to be a very good candidate for this. Purge is supported by modules that adapt Purge's framework for specific technologies, i.e. Varnish or Cloudfront.

Purge URL Queuer is a solution for platforms that do not support cache tag-based invalidation. Note that this module warns against use with large sites.

Generic HTTP Purger is a generalized cache tag purging solution for systems that do not have a dedicated Purge module. Configured correctly, it can make an HTTP request

The Next-Drupal Revalidator plugin looks to be a ready-made solution for handling the HTTP requests to the Next.js. These plugins could be leveraged or extended to make the actual HTTP request for cache invalidation.

Note: Next-Drupal is likely to have a solution for this problem. Follow up this conversation in #nextjs in the Drupal Slack: https://drupal.slack.com/archives/C01E36BMU72/p1690928284178899

Acceptance Criteria

[x] Examine Purge and its family of modules to see if it can provide cache invalidation information
- https://github.com/department-of-veterans-affairs/va.gov-cms/issues/14603#issuecomment-1678139555
[x] Research Purge URL Queuer to see if it is is a suitable solution for purging via URL/path
- purge_queuer_url is not a suitable solution. See https://github.com/department-of-veterans-affairs/va.gov-cms/issues/14603#issuecomment-1678017521
[x] Assess D10 readiness of Purge, Purge URL Querer, and Generic HTTP Purger
- https://github.com/department-of-veterans-affairs/va.gov-cms/issues/14603#issuecomment-1671926983
[x] Assess how Next-Drupal Revalidator can be leveraged to send the request
- https://github.com/department-of-veterans-affairs/va.gov-cms/issues/14603#issuecomment-1678110756
[x] Demonstrate that Purge can correctly identify and collect items to purge. This can be as simple as creating clear testing steps and then examining the Purge queue. This does not need to be done in the VA Drupal instance; a vanilla Drupal instance with contrib modules is fine
- https://github.com/department-of-veterans-affairs/va.gov-cms/issues/14603#issuecomment-1678197889

Team

Please check the team(s) that will do this work.

[ ] CMS Team
[ ] Public Websites
[ ] Facilities
[ ] User support
[x] Accelerated Publishing

timcosgrove commented 1 year ago

Daniel

schiavo commented 1 year ago

One possibility is to create our own custom purger. According to purge READ.ME you add to the queue with

Queueing

Adding invalidations to the queue is the simplest use case and requires a queuer object so that the queue knows who is adding the given items.

$purgeInvalidationFactory = \Drupal::service('purge.invalidation.factory');
$purgeQueuers = \Drupal::service('purge.queuers');
$purgeQueue = \Drupal::service('purge.queue');

$queuer = $purgeQueuers->get('myqueuer');
$invalidations = [
  $purgeInvalidationFactory->get('tag', 'node:1'),
  $purgeInvalidationFactory->get('tag', 'node:2'),
  $purgeInvalidationFactory->get('path', 'contact'),
  $purgeInvalidationFactory->get('wildcardpath', 'news/*'),
];

$purgeQueue->add($queuer, $invalidations);

schiavo commented 1 year ago

Processing example:

Queue processing

Processing items from the queue is handled by processors, which users can add and configure according to their configuration. In essence, processors invoke the following code to retrieve a dynamically calculated chunk of items from the queue and feed those to the purgers service:

use Drupal\purge\Plugin\Purge\Purger\Exception\CapacityException;
use Drupal\purge\Plugin\Purge\Purger\Exception\DiagnosticsException;
use Drupal\purge\Plugin\Purge\Purger\Exception\LockException;
$purgePurgers = \Drupal::service('purge.purgers');
$purgeProcessors = \Drupal::service('purge.processors');
$purgeQueue = \Drupal::service('purge.queue');

$claims = $purgeQueue->claim();
$processor = $purgeProcessors->get('myprocessor');
try {
  $purgePurgers->invalidate($processor, $claims);
}
catch (DiagnosticsException $e) {
  // Diagnostic exceptions happen when the system cannot purge.
}
catch (CapacityException $e) {
  // Capacity exceptions happen when too much was purged during this request.
}
catch (LockException $e) {
  // Lock exceptions happen when another code path is currently processing.
}
finally {
  $purgeQueue->handleResults($claims);
}

schiavo commented 1 year ago

D10 Readiness

purge = Yes purge_queuer_url = No: But there is a working patch in the issue queue purge_purger_http = Yes

schiavo commented 1 year ago

[ ] Demonstrate that Purge can correctly identify and collect items to purge Urls get sent to the queue using purge_queuer_url.

Need to confirm that referenced entities and entities that reference the saved entity get sent. Given that purge_queuer_url is using urls and that often entities are not using urls it's looking like storing entities in the queue might not be effective.

schiavo commented 1 year ago

When using purge_queuer_url urls get queued as expected.

But the list of urls does not include values from reference fields or parent references. This may be acceptable depending on how the urls get processed. And in the above list is inaccurate since the url updated is /node/3

schiavo commented 1 year ago

If the queue is properly populated then there is reasonable chance that the purge_purger_http purger will be able to send the correct data to Next.

schiavo commented 1 year ago

After clearing Drupal cache and running a test of the queuer got this result: Node 1 is the entity updated Node 6 is en entity in the entity reference field on node 1 Node 2 is a second node that references 6

In this scenario the queue contains the correct nodes to invalidate but also contains extra data and several invalid urls.

Follow up --> Find how the invalid urls get in the queue

The urls are not invalid "https://drupal9.ddev.site/bene-caecus-duis-genitus-luptatum-obruo-pala-vindico" is an alias for a taxonomy term.

Find how purge determines what data to include

schiavo commented 1 year ago

Another test.

/article/interdico-ludus-saepius-suscipit = the node that has been updated node/2 and node/1 references the node updated node/6 is an alias of the node udpated

These results make more sense.

schiavo commented 1 year ago

Concerns about purge_queuer_url

In order to populate the list of urls in the registry ...

You need to spider your site to be able to queue URLs or paths, for example run: 'wget -r -nd --delete-after -l100 --spider http://site/'.

The queuer then references the registry purge_queuer_url table to match and populate the url in the queue.

` // When there are still tags left, attempt to lookup URLs and queue them. if (count($tags)) { if ($urls_and_paths = $this->registry->getUrls($tags)) { $invalidations = [];

    // Iterate the matches and add URL/Path invalidations correspondingly.
    foreach ($urls_and_paths as $url_or_path) {
      $invalidation_type = strpos($url_or_path, '://') ? 'url' : 'path';
      try {
        $invalidations[] = $this->purgeInvalidationFactory
          ->get($invalidation_type, $url_or_path);
      }`

How often does the reference table need to be populated? Why use this build process instead of referencing alaises? The table is not populated when new content is added.

schiavo commented 1 year ago

One more thing to think about. The current CMS is using advancedqueue. The motivation for using advancedqueue is it's de-duping functionality.

Currently the build request populates the content release queue.

So another option for sending purge requests to Next would be to set up an AP queue in advancequeue and write a custom module to sent the purge request to next.

timcosgrove commented 1 year ago

purge_queuer_url uses a 'registry' of URLs which maps paths to cache tags contained in the response for that URL.

This registry is hard to fill in our scenario, since it is effectively an event subscriber which adds items to the registry upon a response for that item being sent. IOW, you need to look at URL in question in order for it to be added; or else, the URL needs to be requested somehow. The module itself suggests spidering the site with wget or the like for this.

This is problematic for our needs because our site is effectively unspiderable without a sitemap. There is no central menu which leads to all pages via spidering. In particular, there is no way currently to access a list of all existing medical systems without authenticating to the CMS.

More problematically, the registry is very volatile. If a URL is requested and the response is not cacheable, the URL is removed from the registry: https://git.drupalcode.org/project/purge_queuer_url/-/blob/8.x-1.x/src/StackMiddleware/UrlRegistrar.php#L177

This means for example that if a CMS editor visited a page in the CMS, that URL would be removed from the registry and would become disqualified from being added to the URL queue, at least until that URL was spidered again via an anonymous request.

This alone disallows this module from use. We may use the basic underlying concept somehow - i.e., maintaining a lookup of URLs and the cache tags that apply to those URLs - but this module will not be the mechanism for it.

timcosgrove commented 1 year ago

For a cache tags-based solution, Drupal will need to provide cache tags in its response header. Under normal circumstances, these headers are not included in the response. Purge module provides a framework for including them, and a sample implementation in the purge_purger_http_tagsheader module.

Note that if cachetags are included in the response, they are expected to be stripped out by whatever is brokering the response to the end user. For example, if Varnish is the layer between the end user and Drupal can provide the cache tags in the response that is passed to Varnish; it is expected that Varnish will take the information in the cache tags header, make whatever use of it it needs to, and then strip the headers out. Passing the headers to the end user is considered a potential leak of protected information.

timcosgrove commented 1 year ago

Providing cache tags in the response almost universally causes a Next.js failure currently when attempting to build pages with CMS data. Reason: adding cache tags to the response headers causes the response headers to grow beyond 8k, which is the current limit allowed by node.js by default.

This can be modified by passing a node option to any run of a next command, for example:

"NODE_OPTIONS='--max-http-header-size=30000' next dev"

This would ideally be built into all runs of any next command in our build, rather than needing to add the NODE_OPTIONS bit in front of every single command. Adding the node options to .env.local for example solves the problem, but ideally there'd be a better place to store the env variable.

timcosgrove commented 1 year ago

Use of drupalClient as the foundation for our queries to the CMS will be problematic if we are expecting cache tags to be included in the response header. Under the hood, drupalClient buries the actual request/response deep within its own mechanisms, and only returns data that is specified in the query and that gets return in the JSON response to the query request. Every part of the response but the JSON body is discarded. This means that headers cannot be extracted from the response for our own purposes.

A potential alternate is to include cache tag information in the JSON API response itself. If we pass that information to next via the body we can make good use of it. This cache tag information may want to be dependent on authentication or some sort of shared secret so that JSON:API is not leaking cache tag information.

timcosgrove commented 1 year ago

Next-Drupal's ConfigurableRevalidatorBase.php provides a basic foundation on which we could build a purge processor within the Purge module framework.

These plugins are aware of the Next-Drupal site settings for each entity type that Next-Drupal manages, and so can broker a connection to the Next server that is providing a front-end for that particular entity to revalidate it. The plugins fire upon entity events; this event subscription is handled by next.module.

It should be noted that this is all the plugin base provides; it doesn't even provide a mechanism to make the request to the Next server in question for revalidation, though setting that up is trivial (see ->revalidate() in the example Path.php plugin that is provided.

We would use this plugin model if we wanted to specifically respond to entity events - CRUD - with revalidation. This is a legitimate way of dealing with revalidation but it is limited because it does not easily provide greater access to all paths that are affected by an item changing.

If we are interested in pursuing the cache tag invalidation mechanisms as the basis for our Next.js revalidation setup, we would likely not use this plugin system.

timcosgrove commented 1 year ago

Purge and its family of modules provides a mechanism by which cache tag invalidation information can be collected and acted upon. It consists of two kinds of functions.

Queuers are used to place items that need invalidation into a processing queue. These usually take the form of event subscribers which respond to system events; however, it is also possible to add items to the queue via drush command, via code, and other mechanisms.

Typically what is added to the queue are individual cache tags. The most straightforward example of this is the Core Tags queuer (https://git.drupalcode.org/project/purge/-/tree/8.x-3.x/modules/purge_queuer_coretags). This listens for CacheTagInvalidation events and acts on them, adding invalidated tags to the queue (note this is not directly an EventSubscriber).

Purgers process the queue. In almost all cases a Purger will take the information in the queue and use it to construct an HTTP request to the system that needs purging. These HTTP requests are specific to the system being purged. Varnish purge requests take one form, Cloudfront another, Next.js revalidation requests still another.

But so, Purge is absolutely capable of collecting cache tag information and then sending that to systems which use that information to invalidate cached objects.

timcosgrove commented 1 year ago

The default behavior of Purge is to collect invalidated cache tags and queue them up to send to another system for purging. It does not attempt to do anything apart from sending invalidated tags.

Let's call the 'other system' "FE Cache" - something like Varnish or Next.js caching. Purge assumes a) that Drupal is sending cache tags with every response, and b) the FE Cache is doing something with those cache tags. Ideally, the set of cache tags is saved within the FE Cache and associated with the given request URI.

When Drupal invalidates a cache tag and Purge collects and sends that cache tag information to the FE Cache, it is the job of the FE Cache to use that information to deal with its own caching.

timcosgrove commented 1 year ago

Given all the above, we have a very basic POC. Steps to implement on a ddev instance.

ddev composer require drupal/purge drupal/purge_purger_http
Enable: purge, purge_drush, purge_tokens, purge_ui, purge_processor_cron, purge_queuer_coretags, purge_purger_http
Visit https://va-gov-cms.ddev.site/admin/config/development/performance/purge. Add a purger with the following settings:
- Purge tags
- Leave request be
- Leave headers be
- For body, 'set body payload', and then add [invalidations:separated_pipe] to the body payload
Ensure the queue is empty. If it's not, empty it. Open 'Queue' at the bottom of Purge UI; click the arrow on 'database' in the center; 'empty'.
Edit some page. Make a simple change to it (title is enough), set it to publish, and save.
Refresh Purge UI. There should now be some number of items in the Queue. This can be seen on the right in 'queue size'. Confirm the contents of the queue by selecting 'Inspect' on the 'Database' queue.
ddev ssh to get into the container. Run the following command: nc -k -l 8080. This uses netcat to set up a very simple listener on that port that will echo anything it receives.
Visit cron and run cron. https://va-gov-cms.ddev.site/admin/config/system/cron

With this, you should see response from the netcat listener, something like the following:

BAN / HTTP/1.1
Host: localhost:8090
user-agent: purge_purger_http module for Drupal 8.
content-type: text/plain
Content-Length: 459

menu_link_content_list|menu_link_content_list:va-palo-alto-health-care|menu_link_content:2627|config:system.menu.va-palo-alto-health-care|paragraph_list|paragraph_list:wysiwyg|paragraph:15697|paragraph:75796|paragraph_list:downloadable_file|paragraph:122287|paragraph:103566|paragraph:68717|node_list|node_list:health_care_region_detail_page|node:8019|content_moderation_state_list|content_moderation_state:8085|danse_event_list|config:httppurgersettings_list

The above is an example from local implementation of this POC.

timcosgrove commented 1 year ago

Summarizing everything above:

Drupal provides decent options if we rely on cache tags to identify items to invalidate. Purge itself is very easy to set up, and configuring it to send invalidation information to a remote cache is relatively easy. The mechanism of gathering up cache tags on its own is reliable.

Drupal does not provide a ready way to find URLs from cache tag invalidation. purge_queuer_url is not a good solution. In order for Drupal to actively keep track of URL to cache tag mapping, custom functionality would be needed to be created. This functionality does not need to live with Drupal itself.

drupalClient does not provide a good way to extract header information from responses it receives. This is unfortunate, because Drupal by default wants to send cache tag information in response headers. In order to make cache tag information visible on the Next.js side, we need to replicate drupalClient's functionality; modify drupalClient, possibly in coordination with Chapter Three, to allow header information to be populated into the response returned by drupalClient; or, create custom code on the Drupal side to expose cache tag information in the JSON:API response body, so that the information is ready accessible.

Next.js currently does not provide a good way to work with cache tags currently. Association of cache tags with routes/paths is only available with the App Router, which we are currently not using. Further more, the mechanism for assigning cache tags to a url/route in Next.js happens at fetch time, i.e. request time. In the Drupal paradigm, cache tags are returned with the response; they are not available until the data is received.

Also, in line with drupalClient being a package that obfuscates access to its internals, any fetching done within drupalClient is currently hard to get at to change or manipulate.

timcosgrove commented 1 year ago

Possible solution The idea of maintaining a mapping of URLs and cache tags, such as purge_queuer_url does, is not itself a bad idea. A potential path forward could be something like:

CMS is set up to deliver cache tags with its responses to Next.js queries
Next gathers the cache tag information, along with the URL being generated, and stores it. This store needs to be persistent; it should not live in memory of the running Next.js server process, in the event that the server needs to be restarted. This store could take any number of forms; simple NoSQL key-value store seems like it would be straightforward.
When cache tags are invalidated, Purge module can be leveraged to send Next.js a list of tags to invalidate. These tags can be used with the aforementioned key-value store to find URLs that are associated and that need to be invalidated.

department-of-veterans-affairs / va.gov-cms