freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
545 stars 151 forks source link

Review sitemap variables and performance #4382

Open mlissner opened 2 months ago

mlissner commented 2 months ago

We've engaged a company to help us rank better in generic searches, and one of their early findings is that our sitemaps need work.

A few suggestions they've made are:

They're not great at understanding our sitemaps, and frankly that's fair, since the sitemaps aren't loading, so I'm going to make some notes here about how they work (and don't).

mlissner commented 2 months ago

We have the following sitemaps:

Sitemap description changefreq lastmod priority limit
/sitemap-oa.xml The oral argument sitemap monthly obj.date_modified 1 0.4 50,000
/sitemap-blocked-audio.xml Contains oral argument audio files that had the noindex set on them in the last 30 days. This exists to encourage Google to crawl these items (so Google can stop showing them). daily obj.date_modified 0.6 50,000
/sitemap-o.xml The opinions sitemap yearly obj.date_modified 0.5 50,000
/sitemap-blocked-opinions.xml Blocked opinions, like for OA daily obj.date_modified 0.6 50,000
/sitemap-r.xml Federal dockets. Limited to items filed in last 30 days or with views greater than 10 2 weekly obj.date_modified scaled based on view count from 0.3 for unviewed to 0.65 for > 1,000 views 50,000
/sitemap-blocked-dockets.xml Dockets that had noindex set on them in the last 30 days daily obj.date_modified 0.6 50,000
/sitemap-p.xml For judges (aka People) monthly obj.date_modified 0.5 50,000
/sitemap-disclosures.xml For judicial financial disclosures yearly obj.date_modified 0.5 none, apparently 3
/sitemap-visualizations.xml For visualizations. This really doesn't matter. yearly obj.date_modified 0.4 none!
/sitemap-simple.xml For simple flat pages, like help pages varies based on page, but mostly set to "yearly" not set varies from 0.1 to 0.7 n/a, only a couple dozen pages

1 This is the last time it was updated in our DB, but it doesn't necessarily represent the last relevant update time. This value is often updated when something silly happens to an item, like it's view count is incremented or its title was tweaked, say.

2 The idea here is that if something is new it should show up. If it has more than ten views, it should show up. So this is a list of items that are new (and haven't gotten views yet) or things that have gotten at least ten views within 30 days.

3 This is surprising! I don't know what this would do, but it does seem to be paginated if you go to page=2, or whatever.


Finally, /sitemap.xml is our sitemap index. In theory, it just links to all the others, using pagination where needed (e.g., /sitemap-disclosures.xml, /sitemap-disclosures.xml?page=2, etc).

Generating this page has two challenges:

  1. This page has links to hundreds of pages of sitemaps. To know how how to paginate, it needs to count all the other objects it's linking to. That way it can determine that there's a page 332, but no page 333. These count queries can be very slow, but we might be able to do better here using Elastic Search, which can provide approximate query counts quickly.
  2. This page has a lastmod value for every linked sitemap. To get these, it needs to generate all of the sub-sitemaps, and needs to check all the lastmod values of those sitemaps. That's impossible, so now we know why it's not working!
mlissner commented 2 months ago

There's an open issue from May of last year about our oral argument sitemap timing out: https://github.com/freelawproject/courtlistener/issues/2752. Seems to still be an issue, surprisingly.

I checked a few others: