dat-ecosystem-archive / datproject-discussions

a repo for discussions and other non-code organizing stuff [ DEPRECATED - More info on active projects and modules at https://dat-ecosystem.org/ ]
65 stars 6 forks source link

Modules for munging #18

Open allthesignals opened 9 years ago

allthesignals commented 9 years ago

(In reference to: https://twitter.com/maxogden/status/552158433051291648)

I work for a government agency, and one of the datasets we regularly use for analysis is HUD's Comprehensive Housing Assessment Survey (http://www.huduser.org/portal/datasets/cp.html). This dataset is fairly granular, more granular than ACS data, so it's very useful for researching affordable housing issues.

One of the problems of course is that the data is cross-tabulated at the deepest possible levels across over 20 tables, making it time-consuming to simplify. Worse, it's split into dozens of CSvs that are useless without a very large Excel spreadsheet containing metadata. I've written some things in R to simplify these tabulations, as well as provide some other calculations like margins of error and percentages. The metadata is good enough that I can programmatically filter data using the metadata (a lot like subqueries in SQL).

I imagine something like Dat, especially alongside an ecosystem built on addressing these issues (open-source data munging), would help significantly with this munging process if relevant pipeline modules were available.

Since I've done most of this work in R, I imagined creating a module with R as a dependency, but I'm not sure if 1) this is the right approach 2) Dat is really intended for this

Separately, another use case in my work is updating a set of dozens of "regional indicators" we use to gauge how communities are progressing according to our regional plan (I work at an urban planning agency). Most of this work is baked into Excel spreadsheets and reflects only a single year. We want to update the data behind those indicators to track progress, but this would require a lot of work that is probably error-prone.

Alternatively, I imagine using Dat to create pipelines that munge and reshape the source data behind these indicators. These pipelines would abstract away things like year and geography while enshrining the methodologies behind these indicators. They could be reusable. Is this an appropriate use-case? Are there examples of modules written for this purpose? Or am I completely misunderstanding what Dat is intended for?