HTTPArchive / almanac.httparchive.org

HTTP Archive's annual "State of the Web" report made by the web community
https://almanac.httparchive.org
Apache License 2.0
622 stars 182 forks source link

Autogenerate timestamps #317

Closed tunetheweb closed 4 years ago

tunetheweb commented 5 years ago

As part of https://github.com/HTTPArchive/almanac.httparchive.org/pull/302 and https://github.com/HTTPArchive/almanac.httparchive.org/pull/308 we added Article Structured Data which depends on having a Published Date (required or it fails the Google Structured Data Testing Tool) and a Last Updated Date (recommended by the Google Structured Data Testing Tool). These have been implemented in the chapter markdown files as meta data. For example:

---
part_number: IV
chapter_number: 20
title: HTTP/2
description: HTTP/2 chapter of the 2019 Web Almanac covering adoption and impact of HTTP/2, HTTP/2 Push, HTTP/2 Issues, and HTTP/3
authors: [bazzadp]
reviewers: [bagder, rmarx, dotjs]
published: 2019-11-04T12:00:00+00:00:00
last_updated: 2019-11-04T12:00:00+00:00:00 
---

This allows for different timestamps for different publication dates and also for different timestamps for the translations which might be published at different dates - hence why I put it in the markdown.

At present these dates must be manually maintained which @rviscomi and I have been discussing.

I think that makes sense for the Published Date as this should be set once and forgotten about, but for the last_updated this is more error prone and people may forget to update them when changing the contents.

It should not be generated as part npm run generate as the contents my not have changed for every chapter (or even any chapter!) when it is run, and also generic formatting changes, which are not specific to this chapter (e.g. when base_chapter.html or chapter.html is updated) probably don't count as a "real edit".

Requirements:

The format currently has to be as per above example, however this will be converted as part of npm run generate as per https://github.com/HTTPArchive/almanac.httparchive.org/pull/345 so less of a concern (and we should use an easier format for authors!).

@mikegeyser any thoughts on this?

mikegeyser commented 5 years ago

How about adding a pre-commit hook that updates the last_updated in the markdown files? Another option (perhaps simpler) is to use git itself as a part of npm run generate, something like this:

>> git log -1 --pretty="format:%ci" content/en/2019/performance.md | cat
2019-11-05 22:33:38 -0800%

We can do something dodgy, like call out to the git cli from node?

tunetheweb commented 5 years ago

I'm happy either way.

I'm not that familiar with git, pre-commit hooks, the cli...etc. so think it would be good if you could take this on @mikegeyser ?

Good thing is that as part of https://github.com/HTTPArchive/almanac.httparchive.org/pull/345 we will no longer just dump this date into the files as is, and we use JavaScript to read and format the date, so that gives us more options of converting the date to the format we need from that format that git spits out.

Is that date specific to the repo? Or will the value (and format) change depending on the local timezone of whoever runs npm run generate? Hopefully it's repo-specific so it'll be the same for everyone.

mikegeyser commented 5 years ago

The git commit has timezone information as a part of it.

mikegeyser commented 5 years ago

Are we ok with the standard toISOString() timestamps? So 2019-11-04T18:46:53.000Z instead of 2019-11-04T18:46:53+00:00:00.

tunetheweb commented 5 years ago

Are we ok with the standard toISOString() timestamps? So 2019-11-04T18:46:53.000Z instead of 2019-11-04T18:46:53+00:00:00.

I think so. https://developers.google.com/search/docs/data-types/article#non-amp-sd has this to say:

The date and time the article was first published, in ISO 8601 format.

And the wikipedia link says both are fine.

For the ultimate test, I also tested it in the Structured Data Testing Tool (you can load a URL like my http2 chapter and then edit it) and it accepted both formats!

rviscomi commented 5 years ago

Reopening to add one feature request. I just ran npm run generate for https://github.com/HTTPArchive/almanac.httparchive.org/pull/364/files and was surprised to see unrelated chapters getting their last modified updated. Would it be possible to only update the timestamp on chapters that have been changed? A few reasons:

tunetheweb commented 5 years ago

Was just looking at this. Suspect it's in part from other branches and other forks. As we update the .md files with each generation, they become part of the commit. Need a better way of doing this.

tunetheweb commented 5 years ago

I think we should revert this commit for now.

Or at the very least comment out this line for now: https://github.com/HTTPArchive/almanac.httparchive.org/blob/a2f7576456e381045f336085286f61d89c234324/src/tools/generate/index.js#L1

mikegeyser commented 5 years ago

I'm so sorry, I could have sworn it was working correctly. :(

mikegeyser commented 5 years ago

Oh, I see where the flawed logic is now. :( Updating the file, even just to update the timestamp, will mean that becomes the latest commit - updating the timestamp. This approach could never work if we change source files, but could possibly work if we only use it to generate the timestamps in the html.

Again, I'm really sorry for the mistake.

tunetheweb commented 5 years ago

I really think the way to do this is with a Git Action (that I'm not familiar with).

On merge to master:

The CI/CD dream perfect for automating future edits/corrections.

But not one for launch.

tunetheweb commented 5 years ago

BTW if you change this line:

https://github.com/HTTPArchive/almanac.httparchive.org/blob/774e963a321029f675d6d6a1bd314e4c956e9577/src/tools/generate/generate_last_updated.js#L27

From this:

const command = `git log -1 --date=iso-strict-local ${path} | cat`;

To this:

const command = `git log -1 --date=iso-strict-local master ${path} | cat`;

Then at least it gets the master commit time rather than local branch commit time.

Three issues with this:

  1. It won't work from forks with a merge request - you must run npm run generate from this repo.
  2. Need to ensure your local master is up to date.
  3. You need to run npm run generate after all commits.

So not a great solution either but thought I'd throw it out there in case want to do this short term in next week or two.

tunetheweb commented 4 years ago

@ethomson as you've been helpful on that other issue, have you any thoughts on how best to tackle this one (it was mentioned in that other issue you responded on, but it has much bigger scope that this little one).

To summarise above, we basically want to automatically update a timestamp in a file when a pull request is accepted for that file. The timestamp should be set to the date that the pull request was merged (or perhaps when submitted if that's easier) - but it should only be changed on files changed in that PR, and btw not all files have that timestamp meta data. Let me know if that doesn't make sense and can provide more details with examples.

tunetheweb commented 4 years ago

Ignore this @ethomson. Thanks for you help on the other thread! will have a play.