datopian / datahub

🌀 Rapidly build rich data portals using a modern frontend framework
https://www.portaljs.org
MIT License
2.21k stars 328 forks source link

Auto infer created & last modified dates of a dataset and its resources #1333

Closed olayway closed 1 month ago

olayway commented 1 month ago

Situation

Currently "Created" and "Updated" fields in a dataset's metadata come from the Data Package:

image

Same goes for "Last modified" dates of resources:

image

Problem / Opportunity

Would be nice to infer them for the users based on data retrieved from the underlying GitHub repo so that you don't need to manually update them after any change to your dataset.

Solution

Dataset "Created" date

For simple, standalone dataset sites (like all our core datasets) "Created" date of a dataset can be easily obtained from /repos/{owner}/{repo} GitHub API endpoint (created_at response field).

Dataset "Updated" date

"Updated" date can be derived from the last commit date, which can be obtained from /repos/{owner}/{repo}/commits?per_page=1 (then get first commit from the returned array and use its date).

Resource "Last modified" date

Can be obtained also by using /repos/{owner}/{repo}/commits?per_page=1 with additional parameter: path=<resource-file-path>, which will only return commits that changed the resource file.

Complication

This is getting tricky for nested datasets though. Example:

README.md
/dataset-a
  README.md
  datapackage.json
  data.csv
/dataset-b
  README.md
  datapackage.json
  data.csv
...

In this case:

  1. repository.created_at ≠ individual dataset's creation date
  2. repo's last commit date ≠ individual dataset's last modification date

(No problem for resource last modification dates.)

ad 2.: This can be solved by not using repository.created_at date and instead:

Or if we want to be super precise, we could traverse the whole dataset folder (using gh tree which we already have at disposal), get last commit of each file (including any other markdown files or scripts etc.) and use the youngest date of all.

ad 1.: Since creation date can't be easily retrieved from GH for individual files, we can't apply the same trick as above. But dataset creation date is only set once and never changes, so it's not a big deal to have to set it manually. Also, we could just use the repository creation date by default as nested datasets/dataset collections are probably going to be a minority. Or we could just infer it for our core datasets only.

olayway commented 1 month ago

As we discussed, in cases where we did review a given dataset but no changes were needed, we also want to show this in the UI so that people know that we are regularly checking the dataset for changes and that they get the latest data. In this case we can either:

anuveyatsu commented 1 month ago

@olayway We already have:

olayway commented 1 month ago

I've implemented the first version of it, but it's not ideal and so for now it's only enabled for our core datasets.

How it works atm:

Why it's not perfect:

This is why currently I have only enabled this feature for our core datasets as we are sure they have auto-sync turned on and we can use our PAT for that. In the future, once we have files index table in our db, all the files from users sites will be stored along with their metadata there, updated on site syncs and readily available for display.

cc: @anuveyatsu

olayway commented 1 month ago

FIXED For now, given the current architecture constraints, enabled only for our core sites