dat-ecosystem-archive / datproject-discussions

a repo for discussions and other non-code organizing stuff [ DEPRECATED - More info on active projects and modules at https://dat-ecosystem.org/ ]
65 stars 6 forks source link

data importer tool #3

Open max-mapper opened 10 years ago

max-mapper commented 10 years ago

consider this: https://github.com/maxogden/dat-oakland-land-use

look at the package.json import script. it essentially does these commands

wget -N http://data.openoakland.org/sites/default/files/Oakland_Parcels_06-01-13.zip
unzip -o Oakland_Parcels_06-01-13.zip

and then these as a pipe chain

csv-join http://data.openoakland.org/sites/default/files/ParcelUseCodes2013_0.csv 'Use Code' Oakland_Parcels_06-01-13.csv 'Use code'
bcsv
trim-object-stream
dat import --json --primary \"Assessor's Parcel Number (APN) sort format\""

it would be pretty cool if we had something along the lines conceptually of gulp or grunt but way more minimal. basically take the code for the transformations stuff in dat and make it a standalone module for hooking up data flow/pipe chains using modules from npm

we could call it pipechain or something, and you could make a json file with stuff for it to do, similar to dat transformations but more to cover the use case of getting data into dat in the first place

cc @mafintosh

max-mapper commented 10 years ago

a few more thoughts:

in the spectrum where grunt is on one end, gulp is in the middle and npm run on the other end I think we need something with a unified 'marketing' effort along the lines of gulp and grunt but is actually just npm run. the problem with npm run is that it's a feature lost in the sea of features in npm, doesn't have it's own readme, doesn't have it's own logo, name, community

RichardLitt commented 9 years ago

Huh. I did a study in 2011 of Kepler and Taverna workflow systems that found that basically 38% of the workflows used in bioinformatics were shims - essentially, data converters. I bring this up because there's already an extensive scientific literature on what ideal streaming data conversion might look like. I can look around for some papers if you want any, although it's not this field and would probably be pretty technical. You might want to ask @bmpvieira, seems up his alley.

Building a gulp-like system for dat would be pretty fantastic, I think. I only bring this up because it might be useful to look at best practices or suggestions before attempting a minimal system. Might not be, also. But I think a shimming tool like this would be awesome - and it would be very useful outside of the dat framework, as a whole. I know I would like to use it for linguistics data work.

doowb commented 9 years ago

I just created vinyl-dat which provides a src and dest method for dat databases in gulp workflows. I know it's not as minimal as just using npm run, but it might be useful for people already using gulp and just want add dat.