Auto infer created & last modified dates of a dataset and its resources

olayway commented 1 month ago

Situation

Currently "Created" and "Updated" fields in a dataset's metadata come from the Data Package:

Same goes for "Last modified" dates of resources:

Problem / Opportunity

Would be nice to infer them for the users based on data retrieved from the underlying GitHub repo so that you don't need to manually update them after any change to your dataset.

Solution

Dataset "Created" date

For simple, standalone dataset sites (like all our core datasets) "Created" date of a dataset can be easily obtained from /repos/{owner}/{repo} GitHub API endpoint (created_at response field).

Dataset "Updated" date

"Updated" date can be derived from the last commit date, which can be obtained from /repos/{owner}/{repo}/commits?per_page=1 (then get first commit from the returned array and use its date).

Resource "Last modified" date

Can be obtained also by using /repos/{owner}/{repo}/commits?per_page=1 with additional parameter: path=<resource-file-path>, which will only return commits that changed the resource file.

Complication

This is getting tricky for nested datasets though. Example:

README.md
/dataset-a
  README.md
  datapackage.json
  data.csv
/dataset-b
  README.md
  datapackage.json
  data.csv
...

In this case:

repository.created_at ≠ individual dataset's creation date
repo's last commit date ≠ individual dataset's last modification date

(No problem for resource last modification dates.)

ad 2.: This can be solved by not using repository.created_at date and instead:

getting last modification dates of all resources (we're going to get them anyway for inferring resources "Last modified" dates)
getting last modification date of the README.md and the datapackage file,
using the latest date of all these.

Or if we want to be super precise, we could traverse the whole dataset folder (using gh tree which we already have at disposal), get last commit of each file (including any other markdown files or scripts etc.) and use the youngest date of all.

ad 1.: Since creation date can't be easily retrieved from GH for individual files, we can't apply the same trick as above. But dataset creation date is only set once and never changes, so it's not a big deal to have to set it manually. Also, we could just use the repository creation date by default as nested datasets/dataset collections are probably going to be a minority. Or we could just infer it for our core datasets only.

olayway commented 1 month ago

As we discussed, in cases where we did review a given dataset but no changes were needed, we also want to show this in the UI so that people know that we are regularly checking the dataset for changes and that they get the latest data. In this case we can either:

use existing (in most cases) datapackage.created field and update it on each review regardless if any changes were actually made,
use a new field, e.g. datapackage.reviewed field and leave datapackage.created unchanged.

anuveyatsu commented 1 month ago

@olayway We already have:

datapackage.created -> display as created date on the datahub (if not provided then empty)
datapackage.modified -> display as updated date on the datahub (if not provided use latest commit date for any of the resources)
~~resource.created -> displayed on files section~~
resource.modified -> displayed on files section (if not provided use latest commit date for the resource)
datapackage.synced -> updates when any of resource content changes (eg, a new record is added) but it should provided explicitly so that data curator must update this info in the datapackage

olayway commented 1 month ago

I've implemented the first version of it, but it's not ideal and so for now it's only enabled for our core datasets.

How it works atm:

last commit date is pulled directly from GitHub for each resource when building a dataset page
this last commit date is displayed as "Last modified" date for each resource in the Files table (only if not explicitly specified in resource.lastModified in datapackage)
earliest "Last modified" date of all the resources is also used as an "Updated" date of the whole dataset and displayed in the top metadata table (only if not explicitly specified in datapackage.updated)

Why it's not perfect:

atm we don't have an index of all the site's pages in a separate db table where we could store files creation/modification dates (and other metadata) and update them on every sync
this means if some change has been made to e.g. data.csv file but the site has not beed synced, the dataset page will show that last change date but the file previewed, plotted and available for download will be outdated as it comes from R2, not directly from GH
fetching commits to get last modification date directly from GitHub on-the-fly requires access token and we can't use user's access tokens for that.

This is why currently I have only enabled this feature for our core datasets as we are sure they have auto-sync turned on and we can use our PAT for that. In the future, once we have files index table in our db, all the files from users sites will be stored along with their metadata there, updated on site syncs and readily available for display.

cc: @anuveyatsu

olayway commented 1 month ago

FIXED For now, given the current architecture constraints, enabled only for our core sites

datopian / datahub