Add recipient organisation Additional Data

michaelwood commented 4 years ago

Additional data for recipient organisation information to support other services that use the datastore.

additional_data
 - recipient organisation
  --  postcode for those without one (which can then be used for geographical data)
  --  The latest income of the organisation
  --  The age of the organisation
  --  The type/legal form of the organisation
  --  A "Canonical" ID for the organisation

data sources org-id and possible https://github.com/drkane/find-that-charity-scrapers/

michaelwood commented 4 years ago

Additional data on recipient organisations

robredpath commented 4 years ago

@drkane and @michaelwood will arrange a time to talk about the data store's modular structure and how to hook into that

michaelwood commented 4 years ago

additional data currently created here: https://github.com/ThreeSixtyGiving/datastore/blob/master/datastore/additional_data/grant/__init__.py

(note to self this needs reorganising a bit and moved out of init )

used at loading: https://github.com/ThreeSixtyGiving/datastore/blob/master/datastore/db/management/commands/load_datagetter_data.py#L55

stored in: https://github.com/ThreeSixtyGiving/datastore/blob/master/datastore/db/models.py#L185

michaelwood commented 4 years ago

@drkane I guess this is sort of the same question as with the postcode stuff in that it probably makes sense to have a local cache. If we had added a local cache of findthatcharity results that we wiped every month(?) would it cope ok with that?

Thinking we have a simple model in additional_data such as [ charity number (int) | info (json) ] when we don't have any data in there we go to https://findthatcharity.uk/charity/%s.json % charity_number

drkane commented 4 years ago

@michaelwood I think so. There's a few different options of this, would be good to get yours and @BibianaC's views on what would work best.

I've got an expanded findthatcharity (currently at dev.findthatcharity.uk) which will have a standardised Organisation record which is based on the schema.org/360Giving organisation format. These records are gathered from a variety of sources and use the org id as an identifier. What I'm not sure of is the best way of getting them into the datastore. We could either:

set up a similar table and run the find that charity scrapers to populate it
Do a data dump regularly from findthatcharity
Look them up using https://findthatcharity.uk/charity/%s.json - but is it feasible to do c200,000 api calls every time the model is loaded? Obviously we would cache the results, but there would also need to be something to refresh the data as the details will change.

Not sure of the best way to proceed.

michaelwood commented 4 years ago

Hmm yeah 200,000 api calls wouldn't be much data ~200MiB? but it could take a long time especially if we were gentle on the server with 4 requests a second or something.

So a single data download of the data dump might be the simplest and quickest if that is already possible.

Otherwise:

Querying another database/tables after having done a scrape would be the next best. OR If it is trivial to add a scrapy pipeline insert data via our django models it would be neat if we could have it managed in the same way as the postcode models.

drkane commented 4 years ago

Yep, data dump is pretty easy. The only thing is working out how best to do it automatically - I could probably do it every time findthatcharity updates its data.

As a gzipped CSV file it's 77MB. The column headings are:

id - org id format
name
charityNumber
companyNumber
postalCode
url
latestIncome
latestIncomeDate
dateRegistered
dateRemoved
active - true|false
dateModified
orgIDs - array of alternative org ids
organisationType - array of organisation types
source
canonical_orgid

adding the canonical_orgid is currently what takes most of the time up (because it means looking up every row from a postgres db in an elasticsearch instance). So removing that field would make producing it quicker.

michaelwood commented 4 years ago

The data dump sounds good.

I have envisaged this model on the datastore:

https://github.com/ThreeSixtyGiving/datastore/pull/32/commits/e8e46321b09ece63bce06028ef7f2b674e37c4cc#diff-592aa937943d3ac113983eb645744966

Effectively just having the keys that we will do the look up from the grant data (i.e. charity number) and then having the rest of the data as a json field. I'm keen that we see this data as a cache rather than part of the datastore's "official" schema.

michaelwood commented 4 years ago

@drkane do you have a link for a find that charity data dump ? Thanks

drkane commented 4 years ago

@michaelwood - yep. They can currently be found at URLs in the following format: https://ftc.ftc.dkane.net/orgid/source/{sourceid}.csv, eg:

There's a list of the sources here: https://github.com/drkane/find-that-charity-scrapers/#spiders

drkane commented 4 years ago

@michaelwood Is it worth a call to check how things are going? I'm aware I haven't done much practical help - if you were able to explain the structure of the import process framework I could probably help to write some of the implementations.

drkane commented 4 years ago

The links above will now download all organisations, not just active organisations. You can recreate the old behaviour by adding ?exclude_inactive to the url.

michaelwood commented 4 years ago

@drkane Great thanks.

I've written some logic around importing the multiple org-ids into an Array type from the CSVs which is a little bit creative, not sure if there is a better way. (ignore my comment a "For some reason the data also uses single quotes" as that's obviously to avoid breaking the CSV field.

https://github.com/ThreeSixtyGiving/datastore/commit/d904c35e880bca3e171f2af4d747dd135860df60#diff-0a8399dbdb611d46fb71ebe2057e7b2eR37

Other than that it all seems to be working. I will do some tidy up and submit a pull request for review.

ThreeSixtyGiving / datastore

Add recipient organisation Additional Data #28