codeforIATI / gov-id-finder-data

Data for https://gov-id-finder.codeforiati.org
MIT License
0 stars 0 forks source link

Duplicate IDs #1

Open stevieflow opened 9 months ago

stevieflow commented 9 months ago

Take a look at

https://github.com/codeforIATI/gov-id-finder-data/blob/main/data/KE.csv

There are several entries for the ID 111 -

This is most probably the data source

What do you think @markbrough ?

matmaxgeds commented 9 months ago

I think that although in the extraction, the Ministry of Medical Services and Sigor both have 111 codes, I think that it because that the extraction used the ADM2 level, and the preceeding groupings/number(s) have been missed, as 111 is listed as a 'central' government funding spend, so could really be '1-111', and Sigor is a county or similar, so is under the Community Development Fund, so the full code should be e.g. '2-111'. I am guessing it might be the same for the third use of 111. Thus the issue lies in how the data has been extracted from the data source - it would be incredibly rare (the gov accounts system would break immediately) for a CoA to have multiple entities with the same code. I suspect that the solution is that Adm2 codes cannot be extracted/used without the preceeding ADM1 code - whether it is a number, a letter, or a word.

Also to add, 111 is Ministry of Medical Services in 2006, but in 2013 it looks like 111 is now the Ministry of Lands, Urban Housing etc. That could cause significant misscoding depending on when the data to be matched to the codes is from. When this was added I lobbied for these CoA's to be added with a date built in to avoid potential mistakes as a result, and it still looks like it would be useful to me. In that case, this datafile would be labeled KE-2013 in the CfI list, and then if someone cared, KE-2016 could be added, and users could make an informed choice (but also one which would require more code to handle for automated users).

stevieflow commented 8 months ago

Thanks @matmaxgeds

@markbrough any further thoughts?

markbrough commented 6 months ago

Sorry, just came across this! It looks like it's an issue with the way the data was extracted from the source (BOOST). I am working through tidying up some of these data sources. @matmaxgeds I will note any significant changes since the original data capture (in February 2022) to see how much of an issue it is that CoAs sometimes change.

markbrough commented 6 months ago

So, after looking into this quite a bit, it appears that the top-level ministries (or votes) change occasionally but the sub-votes don't change so often. This codelist needed quite a bit of work as there were indeed a number of local counties and municipalities which were captured here, but shouldn't be. It appears that the BOOST dataset used as the source for this codelist has been updated since it was originally published, so I am not sure where those codes would have come from, as they don't appear in the new dataset.

It does look like there has been a break in the COA coding betwen 2012/13 and 2013/14: https://www.treasury.go.ke/wp-content/uploads/2021/05/FY-2012-13-Development-Budget.pdf https://www.treasury.go.ke/wp-content/uploads/2021/05/FY-2013-14-Development-Budget.pdf

e.g.

The codes have been pretty consistent since then. @stevieflow I think we should discuss the need to review the methodology for COA codes as part of any discussions on organisation identifiers at the IATI MA?

markbrough commented 5 months ago

This is now up to date: https://gov-id-finder.codeforiati.org/countries/KE

I just did a very big update of lots of codelists. There are some codelists which have been recoded (e.g. ZA, ZW, CI) but I think most haven't really changed @matmaxgeds -- but you can take a look at the commit history to see for yourself. Note that some of the renumbering of codes is because the original transcription was incorrect (e.g. missing leading zeroes, or using a source that just happened to show organisations sequentially). It is a lot of work to maintain one codelist for each country, I don't think it is realistic to maintain multiple versions for each year.