Chicago / metalicious

An open source data dictionary which can be deployed to track the metadata of one or more databases.
Other
65 stars 22 forks source link

ETL scripts #30

Closed jpvelez closed 9 years ago

jpvelez commented 10 years ago

ETL scripts are programs that take data from a database/spreadsheet/data source, maybe transform it a bit, and upload it to a data portal for public consumption.

The City of Chicago has hundreds of these running all the time. That's how the data portal stays up to data - for the most part, people aren't manually transferring data.

Since this is City code, shouldn't it be open source? (Possible security issues here, but just spitballing.) You could imagine a repo that would have all the ETL scripts, and a little JSON file tying each ETL script to it's data source on metalicious and its dataset on the portal.

This repo would help with ETL management. But there's more to it: if the ETL scripts were then linked to from metalicious, the data dictionary would provide complete transparency: here's what databases we have, here's where we make them public, and here's the code that does that. I imagine this would be most useful for other cities looking to start open data programs.

tomschenkjr commented 10 years ago

Per open source: no, it contains connectivity information (server names) and, sometimes, API keys and login info. in those scripts.

But, I think that's a useful suggestion in the Metalicious-as-a-platform suggestion. If someone were to deploy Metalicious within their organization, but without public access, it would be very viable.

Likewise, we may want to have some parts of Metalicious be viewable to specific viewers (e.g., security roles) to help manage their data platform, but not expose all elements.

jpvelez commented 10 years ago

Fair enough. Although for what it's worth, you could just separate out the sensitive configuration details from the ETL scripts themselves, the way things are done for open source web apps.

tomschenkjr commented 10 years ago

It'd be worthwhile if we could control the sensitive stuff through .gitignore, but everything is a bit too "baked-in" the code right now to fit in that type of workflow.