WPRDC / wprdc-etl

MIT License
8 stars 3 forks source link

add marshmallow expiriment for schema, pipeline #10

Closed bsmithgall closed 8 years ago

bsmithgall commented 8 years ago

This commit adds a depedency on Marshmallow[1], a lightweight serialization/deserialization library that can also handle validation, requiring fields, etc. Additionally, I have made an attempt to set up simple pipeline building through method chaining. For example, building a pipeline might look like this:

my_pipeline = Pipeline().extractor(CsvExractor).schema(MySchema).load(Datapusher)

This will instantiate a new Pipeline, which would then be able to be either scheduled or run as expected. That functionality is to be added later. I've added a sample job taken from the Fatal OD in the existing WPRDC codebase for an example of what this would look like (and a skeleton test for it as well).

[1] http://marshmallow.readthedocs.org/en/latest/

@saylorsd please take a look at this and let me know what you think.

bsmithgall commented 8 years ago

@saylorsd I reverted some of the changes that you made back to their previous versions (which is in line with my comments from before). I've also dropped the previous dependencies. I've tried to explain the logic in the above commit message/PR but please let me know if you have any questions. I'm still traveling a bit this week but should be relatively available by email/phone/here if you want to talk more about this proposed change in depth.

saylorsd commented 8 years ago

@bsmithgall this looks/sounds great. I'll take an in depth look at it hopefully starting today and let you know if I have any specific questions.

bsmithgall commented 8 years ago

:+1: It would be really great if you could dump in some sample data into a new directory at test/pipelines/mock so that we can start building out full-on tests for these pipelines.