Closed lin-d-hop closed 2 weeks ago
Added 2 sets to the list in the description. This is final list of data sets for Delhi as 25th Oct was cut off date.
This ticket doesn't mention merging. Shall we treat that as a second ticket?
Step 1. I think the entry-point to our systems is via the data factory - every data set needs something to normalise it and geocode the addresses, which is what the DF does. The results get published on the web as standard.csv
files (whether or not that's good name), plus other bits (like a vocabs.json file, and lots of TTL and RDF files). That could be part of this issue? But is not described explicitly yet.
Then there's a "Step 2" which is the merging. You're right, this might be a separate thing, so might warrant another issue. The output of this is probably a table or organisations, (maybe a CSV, akin to the standard.csv above), plus some category indexes used by that table (maybe a vocabs.json fille). This will get big. We may want to think about keeping track of both the inputs and the output via something version-controlly, but perhaps not conventional version control as that's very focussed on text rather than data.
Then there's the script which takes that raw data and munges it into whatever mykomap wants on the back end. Which is Step 3 and what this issue seems to concern.
Note, this step 3 script does not necessarily have to wait for the previous two steps, as it would presumably need to be able to work on the output of the demo-merge-map script, which exists already. (That script probably isn't suitable for "step 2" here as it isn't scalable to more than a small number of data sets.)
The data sets above ticked as of All Hands 07/11 were the data sets in the North America Merge Map with the ICA and DotCoop data filtered to only include North American + Mexican co-ops.
This issue has been replaced by issues #45 #46 #47 #48 #49 CLosing :)
Just for visibility / the record, related PRs:
Description
Data is incoming for the CWM. This spreadsheet lists the data sets that have been shared, and some that are expected. We need to import this data.
The initial data imports will be:
All of these are linked from the spreadsheet. Other data sets are expected to be incoming while this issue is in play, but I think it would be prudent to scope the work at these datasets and extend the scope if resources allow. I want to make sure we have capacity for bug fixing once we get to start doing more thorough end to end testing.
Acceptance Criteria