NYCComptroller / Checkbook

Source codes, data, and instructions for Checkbook
https://checkbooknyc.com/
Other
49 stars 20 forks source link

Greenplum database dependency in ETL process. #19

Open kfogel opened 11 years ago

kfogel commented 11 years ago

The "TL" part of the ETL process for loading data into Checkbook currently requires certain features of the Greenplum database. While Greenplum is proprietary, there is a free-for-noncommercial-use "community edition" available as of this writing, so this situation is not a showstopper. Removing this proprietary dependency would be a useful enhancement, however, to enable Checkbook to run entirely on open source software.

At one point I asked some Checkbook developers about this:

What are the Greenplum-specific parts of the ETL process (or of the "TL" part of that process, at least)? Is it just the "distributed by" clauses in the "create table" commands in the SQL files, for example in source/database/ETL/CREATE_NEW_DATABASE/NYCCheckbookETL_DDL.sql? Or does the dependency go deeper than that? (Postgres-XC supports a "distribute by" clause; one of the reasons I'm asking is to figure out if trying Postgres-XC as a substitute for Greenplum is worth a look.)

Tirupati Reddy answered that the three features of Greenplum DB used in ETL Scripts that are not there in PostgreSQL DB are:

  1. Concept of external tables for loading CSV data
  2. The "Distributed By" feature to distribute the data in different nodes
  3. Using Columnar storage feature on big tables to make queries run faster when filtering data on some columns.

Tirupati said that changing the scripts to remove the need for (2) and (3) is probably not too hard, but (1) would take some time.

treddy commented 10 years ago

Now we have the database scripts and ETL Scripts that will work with the open source postgreSQL database version 9.2x or later. please find the document Creating new Database and running ETL Job_PostgreSQL_Database.doc in documentation folder of checkbook_nyc branch.

kfogel commented 10 years ago

This is exciting news -- I will take a look ASAP! Thank you.

Best, -Karl

kfogel commented 10 years ago

Just curious -- why is this in checkbook_nyc branch instead of either checkbook_edc or master? https://github.com/NYCComptroller/Checkbook/wiki/Branch-Management indicates that one of the latter two would be the place to put it.

It's not a big deal -- obviously, those of us who want to look at it can find it on whatever branch it's on. But it's a bit easier if new things happen on the designated branches, so they are noticeable to people watching those branches.

treddy commented 10 years ago

As of now there is no test data generated for EDC. I will generate the test data after the UAT is done. And then I will check-in the database scripts and ETL job for EDC database in checkbook_edc branch.

kfogel commented 10 years ago

Ah, I understand now, thanks. Yes, it seems reasonable that the documentation lives where the test data lives. I'll take a look at it on checkbook_nyc then. Best of luck with UAT.