18F / 18f.gsa.gov

The 18F website
https://18f.gsa.gov
Other
294 stars 311 forks source link

Migrate related posts #3856

Closed beechnut closed 2 months ago

beechnut commented 4 months ago

The 18F website recommends "related posts" based on posts with similar tags and by the same author.

The Jekyll site used a generator to add "related posts" to a page based on categories/tags/authors — basically, a post was more related if it shared tags, authors, and categories.

It's not a requirement that the generator code is translated exactly from the Jekyll site, but it should serve as a good starting point for an algorithm. I don't recommend trying to understand it as-written — rather, refactor it to understand it.

The scope of work for this ticket is to write a good-enough shortcode or filter that takes a post and optional limit, and returns the most related posts by tag and author (we don't use categories). If several posts have equal or equivalent relatedness scores, sort by newest-first.

### Acceptance smoke tests
- [ ] The ["Senior executive"](https://18f.gsa.gov/2022/07/20/senior-executives-pt1/) series posts all have each other as related posts
- [ ] The ["engineering sandwich" post](https://18f.gsa.gov/2022/09/06/18f-tech-sandwich/) has all technical / open-source related posts.
cantsin commented 3 months ago

This is a fun one.

The biggest issue here is not the plugin code itself, which I found easy to understand. The issue here is architectural: figuring out where this code should reside.

I tried to split up the calculations (related_scores, which needs to traverse all posts and all users separately) and storing it somewhere as a global object, so that the shortcode becomes vastly simpler -- essentially, the shortcode will pull on a cached, pre-calculated related_scores data structure to do its work. But now I'm reconsidering this approach, maybe it is simpler after all to do a 11ty plugin straight up, as that keeps all the related code in one place (you would need to pass in the users and posts to the plugin in .eleventy.js). Either way, the code is fairly easy to translate to javascript; it's just a matter of figuring out the approach we want to take here.

I did not have time to finish this work so I will unassign myself.

beechnut commented 3 months ago

To capture our notes from yesterday's call:

Basic setup

_includes/layouts/post.html L47 should read something like:

{% assign related_posts = page | findRelatedPosts %}

Then, findRelatedPosts should be a function that takes the page variable (which contains data from the current post), and using postsCollection aka collections.posts, finds the 3 related posts to be shown in the footer.

The algorithm

We don't need to exactly re-implement the Jekyll code — prefer simplicity to replicating that behavior.

My only input is that if posts share both authors and some tags, that weight should be substantially (more than 2x) greater than just sharing an author or just sharing some tags.

Caching

Once the basic algorithm is done, we want to consider caching the related posts. The only time we really need to recalculate related posts is when a post is added/removed, or when post tags have changed — recalculating related posts on every build will just be time added to the build.

A shortcut might be to re-cache any time anything in the posts collection (content/posts/*.md) is added/removed/modified. We can use the differ classes in lib/ to list changed files, and then we just detect (in Ruby parlance) against the posts collection pattern.

The cached data can be stored in .cache/ as related-posts-{timestamp? hash?}.(json|csv), and we can add the cached related posts as a collection in config/collections.js.

beechnut commented 2 months ago

Some more thoughts on caching:

.cache/related-posts-{timestamp}.json

{
  "/2022/07/20/senior-executives-pt1/": [
    { 
      "url": "/2022/08/25/senior-executives-pt5/",
      "title": "Senior executives part 5: Use stories as leading indicators",
      "excerpt": "Executives often rely on productivity metrics to measure success, but these measures can obscure whether the software is actually working for users. Stories are a better resource to build a strategy between a senior executive and a product team. This is part five in a series on how senior executive and tech teams can be better allies."
    },
    { "url": , "title": , "excerpt": },
    { "url": , "title": , "excerpt": },
  ],
}

The cache is just the data needed for presentation: title, url, and excerpt. The key for each is a post url, the value being the three related posts' essential linking data.

Design goals: read the JSON file once, keep in memory during build — probably just in .eleventy.js to start.

const latestCacheFile = TODO read the file
const cache = JSON.parse(fs.open(latestCacheFile))

relatedPosts = (page) => { cache[page.url] }

Usage:

{% comment %}
Obviously in the site the HTML is different but
{% endcomment %}
{% assign related_posts = page | relatedPosts %}
{% for post in related_posts %}
  <a href="{{ post.url | url }}">{{ post.title }}</a>
  <p>{{ post.excerpt }}</p>
{% endfor %}
beechnut commented 2 months ago

We have a branch where things are generally working. To wrap this up, we need to:

I think we're another day or so from completion, but we also just got staffed on projects, so, TBD.