GSA / site-scanning

The central repository for the Site Scanning program
https://digital.gov/site-scanning
11 stars 2 forks source link

analyze last modified header scan #1040

Closed gbinal closed 1 week ago

gbinal commented 1 week ago

Following up on #961, parse the data to try and learn how useful it would be.

gbinal commented 1 week ago

Data is here.

My analysis is as such:

1) On the first question:

For comparison...

2) On the utility of this data for inferring content freshness

Another way of looking at this though is to compare sites when they have both the last modified header and open graph article_modified tag in place. The overlap is small 227 sites, and it's not terribly easy to compare/contrast the dates, but when you sort by the open graph tag chronologically, you find that the strong majority, indeed virtually all of the ones that have older open graph article_modified dates (2022 and earlier) have super recent (2024) last modified header dates. Though this sample size is small, it does lead me to question how reliably the header date will be able to indicate that the content on a site is recently updated.

That said, a super old last modifed header would seem to very reliably indicate if content had not been recently updated. As in, if the header has a date of 2015, I don't know that there's any way that the content on that site could be more recent than that.

gbinal commented 1 week ago

done