Closed michaelwood closed 4 years ago
@drkane and @michaelwood will arrange a time to talk about the data store's modular structure and how to hook into that
additional data currently created here: https://github.com/ThreeSixtyGiving/datastore/blob/master/datastore/additional_data/grant/__init__.py
(note to self this needs reorganising a bit and moved out of init )
used at loading: https://github.com/ThreeSixtyGiving/datastore/blob/master/datastore/db/management/commands/load_datagetter_data.py#L55
stored in: https://github.com/ThreeSixtyGiving/datastore/blob/master/datastore/db/models.py#L185
@drkane I guess this is sort of the same question as with the postcode stuff in that it probably makes sense to have a local cache. If we had added a local cache of findthatcharity results that we wiped every month(?) would it cope ok with that?
Thinking we have a simple model in additional_data such as [ charity number (int) | info (json) ] when we don't have any data in there we go to https://findthatcharity.uk/charity/%s.json % charity_number
@michaelwood I think so. There's a few different options of this, would be good to get yours and @BibianaC's views on what would work best.
I've got an expanded findthatcharity (currently at dev.findthatcharity.uk) which will have a standardised Organisation
record which is based on the schema.org/360Giving organisation format. These records are gathered from a variety of sources and use the org id as an identifier. What I'm not sure of is the best way of getting them into the datastore. We could either:
Not sure of the best way to proceed.
Hmm yeah 200,000 api calls wouldn't be much data ~200MiB? but it could take a long time especially if we were gentle on the server with 4 requests a second or something.
So a single data download of the data dump might be the simplest and quickest if that is already possible.
Otherwise:
Querying another database/tables after having done a scrape would be the next best. OR If it is trivial to add a scrapy pipeline insert data via our django models it would be neat if we could have it managed in the same way as the postcode models.
Yep, data dump is pretty easy. The only thing is working out how best to do it automatically - I could probably do it every time findthatcharity updates its data.
As a gzipped CSV file it's 77MB. The column headings are:
id
- org id formatname
charityNumber
companyNumber
postalCode
url
latestIncome
latestIncomeDate
dateRegistered
dateRemoved
active
- true|falsedateModified
orgIDs
- array of alternative org idsorganisationType
- array of organisation typessource
canonical_orgid
adding the canonical_orgid
is currently what takes most of the time up (because it means looking up every row from a postgres db in an elasticsearch instance). So removing that field would make producing it quicker.
The data dump sounds good.
I have envisaged this model on the datastore:
Effectively just having the keys that we will do the look up from the grant data (i.e. charity number) and then having the rest of the data as a json field. I'm keen that we see this data as a cache rather than part of the datastore's "official" schema.
@drkane do you have a link for a find that charity data dump ? Thanks
@michaelwood - yep. They can currently be found at URLs in the following format: https://ftc.ftc.dkane.net/orgid/source/{sourceid}.csv
, eg:
There's a list of the sources here: https://github.com/drkane/find-that-charity-scrapers/#spiders
@michaelwood Is it worth a call to check how things are going? I'm aware I haven't done much practical help - if you were able to explain the structure of the import process framework I could probably help to write some of the implementations.
The links above will now download all organisations, not just active organisations. You can recreate the old behaviour by adding ?exclude_inactive
to the url.
@drkane Great thanks.
I've written some logic around importing the multiple org-ids into an Array type from the CSVs which is a little bit creative, not sure if there is a better way. (ignore my comment a "For some reason the data also uses single quotes" as that's obviously to avoid breaking the CSV field.
Other than that it all seems to be working. I will do some tidy up and submit a pull request for review.
Additional data for recipient organisation information to support other services that use the datastore.
data sources org-id and possible https://github.com/drkane/find-that-charity-scrapers/