BetaNYC / citygram-services-nyc

Web content transformation proxies for open data API's
https://citygram-services-nyc.herokuapp.com/
4 stars 6 forks source link

Extract, transform, load #7

Open volkanunsal opened 9 years ago

volkanunsal commented 9 years ago

The question this issue is two fold. To start a discussion on how to transform existing datasets into a form that can work in a format that Citygram expects, and second to recommend a few tools that are out there, including one that we are planning to incorporate into the Data Portal.

Bad Data Problems

The most likely problem we are going to face in taking and transforming new data sets is that they will not be uniform and structured. They certainly won't fit to the expectations of Citygram. The general solution in such cases is to write a script to clean up the dataset to make it digestible by our system.

Second problem is maintaining the transformation scripts when the original data source changes.

Third problem is how to transform with streaming data sources that provide a near-realtime data source.

Solutions

PGLoader is a commandline utility for importing data from various sources into a Postgres database using a SQL-like transformation script. It can deal with streaming data sources and does not choke when it encounters bad data, instead saving bad data into a file and giving you a discreet warning. PGLoader is production ready right now.

Dat is a project led by former Code for America fellow Max Ogden. It is an ambitious software project that is on track to become Git for Big Data. Even though it's still young, there are some live pilot projects right now. At BetaNYC, we are planning to integrate it to our data portal soonish. However, Dat might be still rough around the edges. It will take some exploratory work to retrofit it to existing workflows. In the long run, this tool would be a great fit for streamlining the data intake process of Citygram.

fma2 commented 9 years ago

Thank you @volkanunsal! I'm looking into both of them

fma2 commented 9 years ago

Both links weren't working.

Reposting-- http://pgloader.io/ http://dat-data.com/

volkanunsal commented 9 years ago

I fail at internetz. :smiley: