Closed olayway closed 1 month ago
As we discussed, in cases where we did review a given dataset but no changes were needed, we also want to show this in the UI so that people know that we are regularly checking the dataset for changes and that they get the latest data. In this case we can either:
datapackage.created
field and update it on each review regardless if any changes were actually made,datapackage.reviewed
field and leave datapackage.created
unchanged.@olayway We already have:
I've implemented the first version of it, but it's not ideal and so for now it's only enabled for our core datasets.
How it works atm:
resource.lastModified
in datapackage)datapackage.updated
)Why it's not perfect:
This is why currently I have only enabled this feature for our core datasets as we are sure they have auto-sync turned on and we can use our PAT for that. In the future, once we have files index table in our db, all the files from users sites will be stored along with their metadata there, updated on site syncs and readily available for display.
cc: @anuveyatsu
FIXED For now, given the current architecture constraints, enabled only for our core sites
Situation
Currently "Created" and "Updated" fields in a dataset's metadata come from the Data Package:
Same goes for "Last modified" dates of resources:
Problem / Opportunity
Would be nice to infer them for the users based on data retrieved from the underlying GitHub repo so that you don't need to manually update them after any change to your dataset.
Solution
Dataset "Created" date
For simple, standalone dataset sites (like all our core datasets) "Created" date of a dataset can be easily obtained from
/repos/{owner}/{repo}
GitHub API endpoint (created_at
response field).Dataset "Updated" date
"Updated" date can be derived from the last commit date, which can be obtained from
/repos/{owner}/{repo}/commits?per_page=1
(then get first commit from the returned array and use its date).Resource "Last modified" date
Can be obtained also by using
/repos/{owner}/{repo}/commits?per_page=1
with additional parameter:path=<resource-file-path>
, which will only return commits that changed the resource file.Complication
This is getting tricky for nested datasets though. Example:
In this case:
repository.created_at
≠ individual dataset's creation date(No problem for resource last modification dates.)
ad 2.: This can be solved by not using
repository.created_at
date and instead:Or if we want to be super precise, we could traverse the whole dataset folder (using gh tree which we already have at disposal), get last commit of each file (including any other markdown files or scripts etc.) and use the youngest date of all.
ad 1.: Since creation date can't be easily retrieved from GH for individual files, we can't apply the same trick as above. But dataset creation date is only set once and never changes, so it's not a big deal to have to set it manually. Also, we could just use the repository creation date by default as nested datasets/dataset collections are probably going to be a minority. Or we could just infer it for our core datasets only.