codeforIATI / IATI-Stats

Python application for generating JSON stats files from IATI data
https://stats.codeforiati.org
Other
0 stars 1 forks source link

Calculate share of publishers' spending that is traceable #21

Open markbrough opened 2 years ago

markbrough commented 2 years ago

Currently, it is difficult to see the share of publishers' spending that is traceable. For example, we know that lots of NGO funding from FCDO and the Netherlands is traceable, but it is unclear how much funding through other implementing partners (e.g. multilaterals) is traceable. It would be useful to start getting a rough idea of this.

To start with, we should capture a list of provider-org/@provider-activity-id for any activities that have incoming traceability, as Incoming Funds and Incoming Commitments (transaction-type/@code = 1 or 11). See #19

We can then run through all publishers' data again, and see which publishers' activities have corresponding activities published elsewhere that report receiving funding from them.

Bjwebb commented 2 years ago

Notes from our call last week:

We have a traceability calculation on the dashboard already, but that is how much of a recipient publisher's incoming funds have an activity id. In this issue we're interested in how many of a provider publisher's activities have a recipient publisher linking to them.

We might eventually want to filter this by humanitarian flag. This will likely be similar to how we currently handle hierarchy.

Bjwebb commented 2 years ago

I've made a commit generated some stats on the dev site:

Here's the ratio between these as percentages: https://gist.github.com/Bjwebb/7abf31ad55f09b4c470d5ad4b78eff73

Some of the numbers will be too low, as I excluded references to publisher's own activities from the traceable sum, but left them in the total sum. I'm going to change the code to exclude those activities from the total sum also.

Bjwebb commented 2 years ago

Here's a spreadsheet with those same percentages, and percentages of activities as well as spend: https://docs.google.com/spreadsheets/d/1iwHB46-3Eq8_OCQ0uJzpYxTNyILwV1vkt0XzeI6oMNc/edit#gid=0 This is sorted by total spend descending, which gives a useful overview of the big publishers I think.

I've excluded activities referenced by a publisher's own activities from the denominator. Unfortunately that excludes some activities that are also referenced by other publishers, so it's possible to get >100%. I'll try to exclude only those activities that are only referenced by that publisher.

My work so far is at https://github.com/codeforIATI/IATI-Stats/compare/main...dev, although I hope to rebase that before opening a PR.

markbrough commented 2 years ago

This is looking really great, @Bjwebb ! Thank you for all this work. It's very exciting that we are already starting to get some real numbers here.

So I think we are currently counting as "traceable" the full value of a publisher X's activities which are referenced as providing incoming funds (either as Incoming Commitments or Incoming Funds) to any other publisher's activities?

Or are we only counting the value that is stated as Incoming Commitments / Incoming Funds on the other publishers' activities?

stevieflow commented 2 years ago

I'd like to suggest a revision in terms of how we count whether an activity is traceable

So far, I think we are in a binary scenario

Someone looking at that this methodology could (devil's advocate) just then say "Ok, let's make sure all our activities have an outgoing link". They could alter their data and see their count rise to 100%.

I think we have to mitigate for such things. The key for IATI is that activities are connected - and collaboration between publishers (by publishing links to each other's activities) is how that is achieved.

It can get a bit complicated by the fact that the current assumed and preferred model of traceability is that of links pointing upwards -- organisations include links to their immediate donor / partner in their data, so we can "trace" through.

This is opposed to the start of any chain including downward links - the argument being, that the org at the start of any chain may not know where the other activities are yet, as they dont exist

If we imagine this a set network of activities for analysis:

IATI-network

In this "network":

This might need revision in terms of implicit value being expressed by colours / labels, as there can be perfectly legitimate reasons for all cases - but I think it would be really useful to disaggregate how we calculate the types of activities we are finding

markbrough commented 2 years ago

Thanks @stevieflow - I think this makes sense, and I also think there's a decent argument in favour of making traceability "downwards" as well as "upwards". Though I'm not sure if there would be counter-arguments around redundancy etc?

I'm also a bit nervous about implementing a methodology that's quite a departure away from what is currently generally expected / implemented by most publishers... Is this something you think should be implemented now, perhaps as an additional calculation to the one that @bjwebb has been working on?

Bjwebb commented 2 years ago

@markbrough

So I think we are currently counting as "traceable" the full value of a publisher X's activities which are referenced as providing incoming funds (either as Incoming Commitments or Incoming Funds) to any other publisher's activities?

Yes, that's right, the full value (of all commitments + disbursements) for the referenced activity.

@stevieflow

but I think it would be really useful to disaggregate how we calculate the types of activities we are finding

My work so far is only looking at incoming links (ie. I only look at provider-org/@provider-activity-id).

Bjwebb commented 2 years ago

BTW, a couple of other notes about my code so far:

stevieflow commented 2 years ago

@markbrough re: traceability methodology - I think the above expresses all possible ways activities can be "linked" - I was keen to try and not make a distinction between the yellow and orange scenarios, as they are all valid and feasible. I'd be more concerned if - at this stage - we baked in some preference for upstream traceability, when the true nature of the data standard permits any route

@Bjwebb

My work so far is only looking at incoming links (ie. I only look at provider-org/@provider-activity-id).

Thanks. It'd also be useful to count receiver-org - but as a separate count to provider.

we only expect references to activities at the bottom of the hierarchy

I think that's true. Some publishers might have internal links between hierarchies - but agree we should just look to the lowest set for now

We could just look at "current" activities

Yes, good point. If chance, then segmenting between current (using the PWYF definition?) and non-current could be interesting, but will start to make the data complex

markbrough commented 2 years ago

Feedback on work so far:

Bjwebb commented 2 years ago

I've added 1.0x support https://github.com/codeforIATI/IATI-Stats/commit/cf1c33e0971b3ee9a5179bdaee8a67c5e8acc7ca

I've found why I was getting >100%. My code that tried to exclude own ref activities from the denominator was broken. I've removed it was broken and not straightforward to fix: https://github.com/codeforIATI/IATI-Stats/commit/a86ab4eed8cc4f7dd8919cb18a70a500ac914157. It looks like it doesn't make a big difference to the numbers coming out for large publishers.

The google sheet is up to date with these changes https://docs.google.com/spreadsheets/d/1iwHB46-3Eq8_OCQ0uJzpYxTNyILwV1vkt0XzeI6oMNc/edit#gid=0

I've also started thinking about what some "end to end" tests might look like, in terms of feeding in multiple publishers of dummy data, and checking the eventual output is what we expect. There's a branch for this, but there's not very much there yet https://github.com/codeforIATI/IATI-Stats/compare/dev...end-to-end-traceability-tests

markbrough commented 2 years ago

Thank you, this all looks great, @Bjwebb !

markbrough commented 2 years ago

So one issue I have noticed is that the publisher-specific files appear to be empty for all publishers here: current/aggregated-publisher/fcdo/traceable_sum_commitments_and_disbursements_by_publisher_id.json

~That also appears to be the case for the existing traceability file: current/aggregated-publisher/fcdo/traceable_activities_by_publisher_id.json~ UPDATE: sorry, I mistook the new count of activities with something quite different in the publishing statistics.

Does it seem as though this would be complicated to adjust, so that the amount for each publisher (from the big list of amounts for each publisher) is stated in each publisher's file?