Closed andylolz closed 3 years ago
(This is the history for https://github.com/codeforIATI/IATI-Stats-public/blob/gh-pages/current/aggregated-publisher/theglobalfund/most_recent_transaction_date.json).
The response looks like:
{
"data": {
"repository": {
"defaultBranchRef": {
"target": {
"history": {
"nodes": [
{
"oid": "97f9cbd5a0ee5613fd154672fe8e45bae8cb15c2",
"file": {
"object": {
"text": "\"2021-05-05\""
}
},
"committedDate": "2021-05-07T06:35:20Z"
},
{
"oid": "3e079f76f8783ad0dcdb445063084c13d8d353bd",
"file": {
"object": {
"text": "\"2021-05-04\""
}
},
"committedDate": "2021-05-06T06:07:49Z"
},
<snip>
]
}
}
}
}
}
}
However, we would need to do this for every publisher (>1000), and possibly up to 4 times as it only returns 100 commits each time (and we need a year of data for the timeliness calculation). The rate limit is 5000 per hour.
So I think we want to do this with a local copy of the git repo. We could still produce it from the git history of the IATI-Stats-public repository though. I could take a look at that next week.
Thanks so much for taking a look at this, @Bjwebb!
That’s a shame… But I don’t know, maybe it’s cleaner to do it locally after all. I’m going to have a little poke about now, just because I’m interested anyway and I don’t have a good understanding of the git database. I’ll post anything I discover here!
It looks like some combination of git ls-tree <commit hash> <path>
(to get the blob hash) and git cat-file blob <blob hash>
(to get the file contents at that commit) might do the trick? Something like that. It looks like it’s possible to use these with GitPython, so that might be a good way to go.
The best date to associate with each commit is probably updated_at
in metadata.json.
I wonder if it’s also worth attempting to retrospectively populate this repo with a download from http://dashboard.iatistandard.org/stats/ . Do you think that seems doable and/or worth it?
I’ve had a go at this sort of approach in codeforIATI/dashboard#40.
A couple more things on this (sorry for all these updates!)
gitaggregate(-publisher)?-dated
stats from http://dashboard.iatistandard.org/stats/ and http://publishingstats.iatistandard.org/stats/ , drop them in a new repo, and pull that in. Then we use the historic data until a certain point in time, and then the git history for more recent stuff. That’s going to be much easier than retrospectively rewriting the git history of IATI-Stats.@Bjwebb WDYT, does that make sense / seem reasonable?
I’ve had a go at doing historical data stuff in https://github.com/codeforIATI/IATI-Stats-historical . I’ve also merged in stats-blacklisted
stuff.
I’ve left all the gitaggregate stuff in the dashboard (now "analytics") repo for now. I guess that’s okay.
One problem so far is: generating timeliness stats seems really really slow! Would be great to try and speed that up.
I’ve also had a go at a dev version, which should appear here: https://analytics-dev.codeforiati.org
Looks good. I agree, having a historical repo like that makes sense. Putting this in the dashboard repo looks okay, especially since it's not written to disk (I think) so we don't have to worry about the files appearing in an odd place.
One problem so far is: generating timeliness stats seems really really slow! Would be great to try and speed that up.
Aha, okay… The really really slow thing here is the git stuff I added! So I think we might need to keep a running total, and store it in the stats repo.
This ticket should now be fixed by d3c60f16a1dddfe0f51d2f8736607ef6b65cdfe1.
I’ve pretty much put this back how it was originally! I modified gitaggregate and gitaggregate-publisher to run from the git history, but apart from that I think it works like the dashboard.
This does mean the IATI-Stats-public repo is much bigger… I don’t know if that is okay, really.
This is the last remaining bit of #1, so replaces that.
@Bjwebb is going to take a look at using the github API for this.