codeforIATI / IATI-Stats

Python application for generating JSON stats files from IATI data
https://stats.codeforiati.org
Other
0 stars 1 forks source link

Get timeliness stats working #12

Closed andylolz closed 3 years ago

andylolz commented 3 years ago

This is the last remaining bit of #1, so replaces that.

@Bjwebb is going to take a look at using the github API for this.

Bjwebb commented 3 years ago

I've found a GraphQL query for this: https://docs.github.com/en/graphql/overview/explorer?query={repository(owner:%20%22codeforIATI%22,%20name:%20%22IATI-Stats-public%22)%20{defaultBranchRef%20{target%20{...%20on%20Commit%20{history(path:%20%22current/aggregated-publisher/theglobalfund/most_recent_transaction_date.json%22)%20{nodes%20{oid%20file(path:%20%22current/aggregated-publisher/theglobalfund/most_recent_transaction_date.json%22)%20{object%20{...%20on%20Blob%20{text}}}committedDate}}}}}}}

(This is the history for https://github.com/codeforIATI/IATI-Stats-public/blob/gh-pages/current/aggregated-publisher/theglobalfund/most_recent_transaction_date.json).

The response looks like:

{
  "data": {
    "repository": {
      "defaultBranchRef": {
        "target": {
          "history": {
            "nodes": [
              {
                "oid": "97f9cbd5a0ee5613fd154672fe8e45bae8cb15c2",
                "file": {
                  "object": {
                    "text": "\"2021-05-05\""
                  }
                },
                "committedDate": "2021-05-07T06:35:20Z"
              },
              {
                "oid": "3e079f76f8783ad0dcdb445063084c13d8d353bd",
                "file": {
                  "object": {
                    "text": "\"2021-05-04\""
                  }
                },
                "committedDate": "2021-05-06T06:07:49Z"
              },
              <snip>
            ]
          }
        }
      }
    }
  }
}

However, we would need to do this for every publisher (>1000), and possibly up to 4 times as it only returns 100 commits each time (and we need a year of data for the timeliness calculation). The rate limit is 5000 per hour.

So I think we want to do this with a local copy of the git repo. We could still produce it from the git history of the IATI-Stats-public repository though. I could take a look at that next week.

andylolz commented 3 years ago

Thanks so much for taking a look at this, @Bjwebb!

That’s a shame… But I don’t know, maybe it’s cleaner to do it locally after all. I’m going to have a little poke about now, just because I’m interested anyway and I don’t have a good understanding of the git database. I’ll post anything I discover here!

andylolz commented 3 years ago

It looks like some combination of git ls-tree <commit hash> <path> (to get the blob hash) and git cat-file blob <blob hash> (to get the file contents at that commit) might do the trick? Something like that. It looks like it’s possible to use these with GitPython, so that might be a good way to go.

The best date to associate with each commit is probably updated_at in metadata.json.

I wonder if it’s also worth attempting to retrospectively populate this repo with a download from http://dashboard.iatistandard.org/stats/ . Do you think that seems doable and/or worth it?

andylolz commented 3 years ago

I’ve had a go at this sort of approach in codeforIATI/dashboard#40.

andylolz commented 3 years ago

A couple more things on this (sorry for all these updates!)

@Bjwebb WDYT, does that make sense / seem reasonable?

andylolz commented 3 years ago

I’ve had a go at doing historical data stuff in https://github.com/codeforIATI/IATI-Stats-historical . I’ve also merged in stats-blacklisted stuff.

I’ve left all the gitaggregate stuff in the dashboard (now "analytics") repo for now. I guess that’s okay.

One problem so far is: generating timeliness stats seems really really slow! Would be great to try and speed that up.

I’ve also had a go at a dev version, which should appear here: https://analytics-dev.codeforiati.org

Bjwebb commented 3 years ago

Looks good. I agree, having a historical repo like that makes sense. Putting this in the dashboard repo looks okay, especially since it's not written to disk (I think) so we don't have to worry about the files appearing in an odd place.

andylolz commented 3 years ago

One problem so far is: generating timeliness stats seems really really slow! Would be great to try and speed that up.

Aha, okay… The really really slow thing here is the git stuff I added! So I think we might need to keep a running total, and store it in the stats repo.

andylolz commented 3 years ago

This ticket should now be fixed by d3c60f16a1dddfe0f51d2f8736607ef6b65cdfe1.

andylolz commented 3 years ago

I’ve pretty much put this back how it was originally! I modified gitaggregate and gitaggregate-publisher to run from the git history, but apart from that I think it works like the dashboard.

This does mean the IATI-Stats-public repo is much bigger… I don’t know if that is okay, really.