figure out how to incorporate/normalize data sources

switzersc commented 10 years ago

We also need to finalize how we'd like to handle the different data sources. Options:

add almost all data to one Locations table, adding columns (or adding serialized data to one json column in the postgres db), as necessary as CSVs are added, and merging/deduplicating data as CSVs are added. PROS: quick and easy way to normalize data, aggregate as much data as possible, and make it as easy to use as possible by data scientists/visualizers. CONS: requires developers to maintain and add csvs, may get unsustainable and messy with a bunch of very different data dumped into one table
keep most data in CSVs and just store super basic info in locations table (addresses, coordinates), as well as which sources that location appears in. Therefore the CSVs could be queried and data be kept as unadulterrated as possible, but still be related to common records. PROS: more scalable, perhaps keep more 'pristine' data. CONS: would require more advanced querying, would require developers for ongoing support, csvs kind of suck.
Store each csv as its own table, and the rows can be related to a common Location table as well (similar to 2 but storing the csv in database instead of as CSVs). PROS: same as above, plus added benefits of using a database, like performance/querying goodness. CONS: still would require developers to add/maintain data sources
Store each csv as a json-data row in a larger sources table. This would more easily allow users in a web interface to add csv data to the database without losing data or having to normalize anything.
Utilize elasticsearch as a data store and search engine, because that'd be baller and might be the most flexible solution since we don't quite know how we want to use the data yet.
Move to a noSQL database and pretend we're cool.

Ideas? Thoughts?

webmaven commented 10 years ago

I'm working a similar problem in my spare time to try and consolidate data on plant varieties (particularly heirloom food crops). Each source needs to be scrubbed and deduplicated, and then during aggregation not all sources have the same fields available, so the canonical data set has the union of fields from the sources, along with pointers to the provenance of the data (which I haven't worked out yet).

I'll be very interested in seeing how you solve your similarly shaped problems.

urbildpunkt commented 9 years ago

I am not sure if this is at all useful or relevant, but I wanted to leave this here just in case Miso Dataset.

CivicTechAtlanta / show-me-the-food

figure out how to incorporate/normalize data sources #6