[CWM] Write a script(s) to import datasets

lin-d-hop commented 1 month ago

Description

Data is incoming for the CWM. This spreadsheet lists the data sets that have been shared, and some that are expected. We need to import this data.

The initial data imports will be:

[x] NCBA Members
[x] DotCoop registrants(.coop & marque)
[x] ICA
[x] Co-ops UK Open Data
[x] USDA
[x] NCG
[x] US Federally Insured Credit Unions
[x] Farm Credit Administrators
[x] IFFCO membership data
[x] Cooperatives Mutuals Canada Federation
[x] US Worker Co-op Federation

All of these are linked from the spreadsheet. Other data sets are expected to be incoming while this issue is in play, but I think it would be prudent to scope the work at these datasets and extend the scope if resources allow. I want to make sure we have capacity for bug fixing once we get to start doing more thorough end to end testing.

Acceptance Criteria

The above data sets have been imported and are visible on our megamap.

ColmDC commented 1 month ago

Added 2 sets to the list in the description. This is final list of data sets for Delhi as 25th Oct was cut off date.

ColmDC commented 1 month ago

This ticket doesn't mention merging. Shall we treat that as a second ticket?

wu-lee commented 1 month ago

Step 1. I think the entry-point to our systems is via the data factory - every data set needs something to normalise it and geocode the addresses, which is what the DF does. The results get published on the web as standard.csv files (whether or not that's good name), plus other bits (like a vocabs.json file, and lots of TTL and RDF files). That could be part of this issue? But is not described explicitly yet.

Then there's a "Step 2" which is the merging. You're right, this might be a separate thing, so might warrant another issue. The output of this is probably a table or organisations, (maybe a CSV, akin to the standard.csv above), plus some category indexes used by that table (maybe a vocabs.json fille). This will get big. We may want to think about keeping track of both the inputs and the output via something version-controlly, but perhaps not conventional version control as that's very focussed on text rather than data.

Then there's the script which takes that raw data and munges it into whatever mykomap wants on the back end. Which is Step 3 and what this issue seems to concern.

Note, this step 3 script does not necessarily have to wait for the previous two steps, as it would presumably need to be able to work on the output of the demo-merge-map script, which exists already. (That script probably isn't suitable for "step 2" here as it isn't scalable to more than a small number of data sets.)

ColmDC commented 2 weeks ago

The data sets above ticked as of All Hands 07/11 were the data sets in the North America Merge Map with the ICA and DotCoop data filtered to only include North American + Mexican co-ops.

lin-d-hop commented 2 weeks ago

This issue has been replaced by issues #45 #46 #47 #48 #49 CLosing :)

wu-lee commented 1 week ago

Just for visibility / the record, related PRs:

50
51
53

DigitalCommons / mykomap-monolith

[CWM] Write a script(s) to import datasets #30

Description

Acceptance Criteria

50

51

53