ThreeSixtyGiving / datastore

A Data Store application for 360Giving
GNU Affero General Public License v3.0
0 stars 1 forks source link

Add recipient organisation Additional Data #28

Closed michaelwood closed 4 years ago

michaelwood commented 4 years ago

Additional data for recipient organisation information to support other services that use the datastore.

additional_data
 - recipient organisation
  --  postcode for those without one (which can then be used for geographical data)
  --  The latest income of the organisation
  --  The age of the organisation
  --  The type/legal form of the organisation
  --  A "Canonical" ID for the organisation 

data sources org-id and possible https://github.com/drkane/find-that-charity-scrapers/

michaelwood commented 4 years ago

Additional data on recipient organisations

robredpath commented 4 years ago

@drkane and @michaelwood will arrange a time to talk about the data store's modular structure and how to hook into that

michaelwood commented 4 years ago

additional data currently created here: https://github.com/ThreeSixtyGiving/datastore/blob/master/datastore/additional_data/grant/__init__.py

(note to self this needs reorganising a bit and moved out of init )

used at loading: https://github.com/ThreeSixtyGiving/datastore/blob/master/datastore/db/management/commands/load_datagetter_data.py#L55

stored in: https://github.com/ThreeSixtyGiving/datastore/blob/master/datastore/db/models.py#L185

michaelwood commented 4 years ago

@drkane I guess this is sort of the same question as with the postcode stuff in that it probably makes sense to have a local cache. If we had added a local cache of findthatcharity results that we wiped every month(?) would it cope ok with that?

Thinking we have a simple model in additional_data such as [ charity number (int) | info (json) ] when we don't have any data in there we go to https://findthatcharity.uk/charity/%s.json % charity_number

drkane commented 4 years ago

@michaelwood I think so. There's a few different options of this, would be good to get yours and @BibianaC's views on what would work best.

I've got an expanded findthatcharity (currently at dev.findthatcharity.uk) which will have a standardised Organisation record which is based on the schema.org/360Giving organisation format. These records are gathered from a variety of sources and use the org id as an identifier. What I'm not sure of is the best way of getting them into the datastore. We could either:

Not sure of the best way to proceed.

michaelwood commented 4 years ago

Hmm yeah 200,000 api calls wouldn't be much data ~200MiB? but it could take a long time especially if we were gentle on the server with 4 requests a second or something.

So a single data download of the data dump might be the simplest and quickest if that is already possible.

Otherwise:

Querying another database/tables after having done a scrape would be the next best. OR If it is trivial to add a scrapy pipeline insert data via our django models it would be neat if we could have it managed in the same way as the postcode models.

drkane commented 4 years ago

Yep, data dump is pretty easy. The only thing is working out how best to do it automatically - I could probably do it every time findthatcharity updates its data.

As a gzipped CSV file it's 77MB. The column headings are:

adding the canonical_orgid is currently what takes most of the time up (because it means looking up every row from a postgres db in an elasticsearch instance). So removing that field would make producing it quicker.

michaelwood commented 4 years ago

The data dump sounds good.

I have envisaged this model on the datastore:

https://github.com/ThreeSixtyGiving/datastore/pull/32/commits/e8e46321b09ece63bce06028ef7f2b674e37c4cc#diff-592aa937943d3ac113983eb645744966

Effectively just having the keys that we will do the look up from the grant data (i.e. charity number) and then having the rest of the data as a json field. I'm keen that we see this data as a cache rather than part of the datastore's "official" schema.

michaelwood commented 4 years ago

@drkane do you have a link for a find that charity data dump ? Thanks

drkane commented 4 years ago

@michaelwood - yep. They can currently be found at URLs in the following format: https://ftc.ftc.dkane.net/orgid/source/{sourceid}.csv, eg:

There's a list of the sources here: https://github.com/drkane/find-that-charity-scrapers/#spiders

drkane commented 4 years ago

@michaelwood Is it worth a call to check how things are going? I'm aware I haven't done much practical help - if you were able to explain the structure of the import process framework I could probably help to write some of the implementations.

drkane commented 4 years ago

The links above will now download all organisations, not just active organisations. You can recreate the old behaviour by adding ?exclude_inactive to the url.

michaelwood commented 4 years ago

@drkane Great thanks.

I've written some logic around importing the multiple org-ids into an Array type from the CSVs which is a little bit creative, not sure if there is a better way. (ignore my comment a "For some reason the data also uses single quotes" as that's obviously to avoid breaking the CSV field.

https://github.com/ThreeSixtyGiving/datastore/commit/d904c35e880bca3e171f2af4d747dd135860df60#diff-0a8399dbdb611d46fb71ebe2057e7b2eR37

Other than that it all seems to be working. I will do some tidy up and submit a pull request for review.