datasets / publicbodies

A database of public bodies such as government departments, ministries etc.
http://publicbodies.org
MIT License
64 stars 28 forks source link

Organisation identifiers (for discussion) #41

Open markbrough opened 11 years ago

markbrough commented 11 years ago

This is an idea that I've been thinking about for a while. I discussed it with @rgrp a couple of weeks ago and wanted to share it with the list to see what everyone thinks.

The short version: could public bodies be used to generate usable organisation identifiers?

Background

The IATI Standard is an XML based format for sharing detailed information about aid projects. Fundamentally, the model shows resource flows from one organisation to another, with various classifications in between and many financial transactions as part of each project. So like this:

activity (DFID -> World Health Organisation)
  - transaction (GBP 500 disbursed on 2013-05-01)
  - transaction (GBP 500 disbursed on 2013-07-05)

For the private sector and NGOs, the methodology for uniquely identifying organisations is:

Jurisdiction-National registration body-Number e.g. for Oxfam GB, registered at the Charity Commission, with reg number 202918: GB-CHC-202918

For governments, the following methodology is used: Jurisdiction-OECD/DAC Agency code e.g. for the UK's Department for International Development: GB-1

For multilaterals, we use the following methodology: OECD/DAC Channel code e.g. for the World Bank's International Development Association (IDA): 44002

Problems

Agency codes

Many organisations publishing IATI data will therefore struggle to provide unique organisation identifiers for many of the public sector / international organisations that they are working with.

Rationale

Fuzzy reconciliation / text matching of organisations, with an API that assigns an existing identifier where available, and creates a new one where it's not available

1) Organisations (initially, preferably those with a large amount of data) throw four key pieces of data at the API:

2) the API responds with one of the following (possibly using HTTP status codes?): a) Organisation found => use code BW-1 b) Organisation not found => created code BW-21

it also stores the data about the last recorded transaction, so that other people know that that organisation may have existed on that date.

Another source could be Charts of Accounts, existing lists (like those that exist on PB already), budget documents, and structured spending data, e.g. from OpenSpending.

Dealing with duplicates

This will probably lead to some duplicates being created. There could be some manual reconciliation for this. Organisations could have a primary identifier and several secondary identifiers that were used by duplicate organisations..

Dealing with changing organisations

Organisations can be created / deleted / merged in the real world. This should probably lead to: a) created - a new identifier gets created; b) merged - a new identifier gets created for the new organisation; and (manually) the old organisations are linked / related to the new organisation; c) deleted - the identifier continues to exist, because old (and possibly future) data will still refer to it. However, it should be (manually) marked as no longer existing, pointing to a successor organisation of one exists (with some flag to explain whether it's a wholly .

Questions

1) Does this sound sensible? Is it a good idea? Is there a better alternative? 2) Will the fuzzy matching be accurate enough to be useful? Is it likely to assign organisations an incorrect code? 3) How should the identifiers be identified as being created by Public Bodies - just a prefix like PB-?

OECD-DAC codelists:

marians commented 11 years ago

Wouldn't it be great to think in URLs here? E.g, Oxfam could then be http://publicbodies.org/GB-CHC-202918 or even better http://publicbodies.org/GB/CHC/202918.

practicalparticipation commented 11 years ago

@marians In the IATI ORG ID standard we've designed it to work for legacy systems where URLs are not valid values of a database field, and to avoid being tied to any particular domain for resolving identifiers, but so that the pattern of identifiers can be very easily converted into URLs and resolved by any number of services.

See for example: http://opencirce.org/org/

So - URL compatible string-based IDs has been guiding principle.

marians commented 11 years ago

@practicalparticipation Thanks for the comment & info!

practicalparticipation commented 11 years ago

@markbrough Easy questions first:

3) How should the identifiers be identified as being created by Public Bodies - just a prefix like PB-? In the IATI ORG ID standard the namespace should really be:

MISC-PB-{ID}

At the moment the registry of namespaces for IATI is just a spreadsheet - but there's the goal of making this a shared list and getting some better services for managing it in future, including services that can help resolve namespace prefixes into URLs for getting more information on any named entity.

Now the trickier ones:

_(1) Does this sound sensible? Is it a good idea? Is there a better alternative? _

Obviously the best case is whenever official lists of bodies actually exist - but we know this is often not the case. But presumably the mapping element of this would mean that if an official list did become available it could be mapped to these 'incubator ids' - and if the service provided a 'canonical ID' API that when called with an ID would check if a better one had come along, or if the ID requested had been merged with another - would return a canonical ID, we would get to a far better place in terms of users being able to find when they are dealing with data about the same organisation.

I doubt we'll get many original govt publishers of data using these IDs, but the potential for them to harmonise how re-users of the data represent the information they have is interesting.

The risk of false positives and bad matches getting into data and leading to wrong conclusions 'downstream' is fairly big with this - so thinking about provenance or 'certainty' information that an API might return could be important.

_(2) Will the fuzzy matching be accurate enough to be useful? Is it likely to assign organisations an incorrect code? _

I think this is going to be a challenge and a risk. When we get down to names of schools, health services etc. then real problems of name clashes are likely to occur. At the level of departments the problem is less likely.

Thinking about the other data that might be used in fuzzy matching, like 'city of head office', 'website address' etc. that could help firm up matches might be useful.

bill-anderson commented 11 years ago

For IATI this issue is fast heading away from being a problem towards becoming a road crash. So my current twopence is:

We've spent the last year or so searching for a methodology that has both pragmatic logic and political traction. There's nothing substantial out there and a depressing lack of interest from a range of bodies whom one would think would need this as badly as IATI does.

We've always said that it is not IATI's business to curate such a methodology: it needs a wider home. But we're reaching a point where we've got to do something.

The way we are going to solve this problem is by thrashing as many ideas around as possible - so this is an excellent thread. Good stuff @markbrough !

I'm not convinced that a system based on the machine interpretation of spelling, however sophisticated the algorithm is, is going to be efficient enough.

Here's another imperfect idea...

https://docs.google.com/spreadsheet/ccc?key=0AnWngmdQt3stdGNDVDB5SlZrWVNkd0w4a1FWX0xTY2c#gid=0

I've scraped the CIA Heads of States list and built a (tidied) list of names of current departments and added a code which is a mixture of the name and a counter (which allows new names to be added manually in at least some kind of logical order).

In IATI the Rwandan Ministry of Finance and Economic Planning would become something like

MISC-PB-RW-FI18

Problem with this coding is that the code is language specific. Not a good idea for a global list.

With this approach the list is centrally curated and manual intervention would be required to create a new code. Is this a good or bad thing? While names of government departments may be maintained with relative ease, government agencies are whole different ball game.

jpmckinney commented 11 years ago

The Sunlight Foundation proposes using a UUID (and possibly scoping the UUID to a country) and then using a reconciliation/ID resolution service to avoid duplicates: https://github.com/opencivicdata/opencivicdata/wiki/Entity-ID-Resolution-Service

rufuspollock commented 11 years ago

Note connection here with #23 and discussion around keys ...

augusto-herrmann commented 5 years ago

Considering that government structure changes quite frequently in most countries, I think this project should have some instructions or guidelines on how to handle the merge, split, and transformation of public bodies.

We could take as an example the policy paper from OpenCorporates on How OpenCorporates should handlecompany number problems. There should be some identifiable parallels on how they deal with company data and how we deal with public organizations data.

markbrough commented 5 years ago

Hey @augusto-herrmann thanks for bringing this thread back to life, I had forgotten about it :)

A couple of years ago @practicalparticipation was commissioned by IATI to write a discussion paper on this which is worth taking a look at. It explores a number of different approaches. There is a bunch of discussion on that paper here. My own view now is that we should be using (existing) government Charts of Accounts as the primary source for these codelists (rather than the approach I had set out above).

I know that this approach would be imperfect, but my argument is that it is at least a solid start to dealing with this problem. I haven't really seen anything to dissuade me of this argument over the last couple of years.

markbrough commented 2 years ago

An update on this issue: we now have codes for 50 countries, based on country budgets or charts of accounts, extracted and published here: https://gov-id-finder.codeforiati.org/

The source repository is here: https://github.com/codeforIATI/gov-id-finder-data

According to the methodology detailed on the site, the organisation identifier for Ministry of Health and Social Welfare - Liberia is LR-COA-310