GSA / digitalgov.gov

Digital.gov: Better websites. Better government.
https://digital.gov
Other
218 stars 299 forks source link

Remove lastmod tag in Sitemap for reindexing #7801

Closed klin2020 closed 1 month ago

klin2020 commented 1 month ago

Summary

Current search on DG results in outdated articles that bury more recent articles.

Search.gov search results are based on a ranking algorithm that looks at the tag in a website's sitemap or a page's date, whichever is most recent. Our sitemap currently updates the tag to be the current date, leading to the ranking algorithm to weigh every page on DG equally, rather than by its proper publish date

Solution

Remove the tag in the DG sitemap build, so when we re-index DG, the re-index will use the page metadata for its proper date.

Once re-index occurs, we can edit the tag to reflect the page's publish date, rather than the current date.

Screenshots

Current sitemap (including ). Every date reflects the same date, causing issues with the ranking algorithm

Screenshot 2024-07-22 at 1 14 46 PM

Proposed change to sitemap (temporarily remove for Search.gov re-indexing)

Screenshot 2024-07-22 at 1 13 38 PM
github-actions[bot] commented 1 month ago

:mag: Preview in Federalist

RileySeaburg commented 1 month ago

@klin2020 Thanks for the detailed explanation. Some questions:

  1. Which meta fields will search.gov use when we re-index after we remove the lastmod field?

  2. We don't use a lastmod field in the markdown to set for each page, is this something we should consider adding in the future if we want to improve our sitemap?

  3. Could we use .Params.date field to as the next best option for setting the lastmod field?


<lastmod>{{ safeHTML ( .Params.date "2006-01-02T15:04:05-07:00" ) }}</lastmod>{{ end }}{{ with .Sitemap.ChangeFreq }}

I was wondering this as well. I was expecting the last modified date to default to the date published.

klin2020 commented 1 month ago

Hi @nick-mon1 @RileySeaburg

  1. The search.gov reindexing will look at that is in every page on DG. Refer to "Freshness" section of this article.
  2. Since we very rarely update our websites, we don't necessarily need to add it in the future, but it is something we can consider. The lastmod field that the search indexing looks at is the lastmod field in sitemap, so if we add it back into our sitemap, we should just update the lastmod field to store .Params.date
  3. Yes, I agree. It may be best to add this back into our sitemap after the re-indexing, just to ensure that the re-indexing will only look at the meta tag property. Let me send you some further information on this
RileySeaburg commented 1 month ago

@klin2020

To be clear, I'm not sure we need to remove <lastmod> to have the site reindexed.

We prefer documents that are fresh. Anything published or updated in the past 30 days is considered fresh. After that, we use a Gaussian decay function to demote documents, so that the older a document is, the more it is demoted. When documents are 5 years old or older, we consider them to be equally old and do not demote further. We use either the article:modified_time on an individual page, or that page’s <lastmod> date from the sitemap, whichever is more recent. If there is only an article:published_time for a given page, we use that date

Unless I'm misunderstanding something, updating the <lastmod> tags to reflect the content publish date, and then requesting a reindex should fix this issue.

Please explain the proper procedure if I am incorrect.

If I am not, please update the <lastmod> tag.

klin2020 commented 1 month ago

@nick-mon1 @RileySeaburg Re-introduced lastmod tag with page date.

klin2020 commented 1 month ago

Removed lastmod tag for review again @nick-mon1 @RileySeaburg

RileySeaburg commented 1 month ago

I'm going to merge this so we can test the re-index today.

@mejiaj there will be another PR where the tag is added back in.