The question this issue is two fold. To start a discussion on how to transform existing datasets into a form that can work in a format that Citygram expects, and second to recommend a few tools that are out there, including one that we are planning to incorporate into the Data Portal.
Bad Data Problems
The most likely problem we are going to face in taking and transforming new data sets is that they will not be uniform and structured. They certainly won't fit to the expectations of Citygram. The general solution in such cases is to write a script to clean up the dataset to make it digestible by our system.
Second problem is maintaining the transformation scripts when the original data source changes.
Third problem is how to transform with streaming data sources that provide a near-realtime data source.
Solutions
PGLoader is a commandline utility for importing data from various sources into a Postgres database using a SQL-like transformation script. It can deal with streaming data sources and does not choke when it encounters bad data, instead saving bad data into a file and giving you a discreet warning. PGLoader is production ready right now.
Dat is a project led by former Code for America fellow Max Ogden. It is an ambitious software project that is on track to become Git for Big Data. Even though it's still young, there are some live pilot projects right now. At BetaNYC, we are planning to integrate it to our data portal soonish. However, Dat might be still rough around the edges. It will take some exploratory work to retrofit it to existing workflows. In the long run, this tool would be a great fit for streamlining the data intake process of Citygram.
The question this issue is two fold. To start a discussion on how to transform existing datasets into a form that can work in a format that Citygram expects, and second to recommend a few tools that are out there, including one that we are planning to incorporate into the Data Portal.
Bad Data Problems
The most likely problem we are going to face in taking and transforming new data sets is that they will not be uniform and structured. They certainly won't fit to the expectations of Citygram. The general solution in such cases is to write a script to clean up the dataset to make it digestible by our system.
Second problem is maintaining the transformation scripts when the original data source changes.
Third problem is how to transform with streaming data sources that provide a near-realtime data source.
Solutions
PGLoader is a commandline utility for importing data from various sources into a Postgres database using a SQL-like transformation script. It can deal with streaming data sources and does not choke when it encounters bad data, instead saving bad data into a file and giving you a discreet warning. PGLoader is production ready right now.
Dat is a project led by former Code for America fellow Max Ogden. It is an ambitious software project that is on track to become Git for Big Data. Even though it's still young, there are some live pilot projects right now. At BetaNYC, we are planning to integrate it to our data portal soonish. However, Dat might be still rough around the edges. It will take some exploratory work to retrofit it to existing workflows. In the long run, this tool would be a great fit for streamlining the data intake process of Citygram.