WPRDC / wprdc-etl

MIT License
8 stars 3 forks source link

Implement binary file loader #29

Open saylorsd opened 8 years ago

saylorsd commented 8 years ago

:question: @bsmithgall
For files that don't lend themselves to data extraction (pdfs, images, etc) what should we do about the extract() part of the pipeline?

From what I can tell here, it seems like the current process requires that the input file be broken down line by line.

Would it make sense to query pipeline._extractor to see if handles lines, or just how it generally handles data, and go from there?

This is also something that can probably wait, as most of our current datasets will work just fine with the currently implemented loader.

bsmithgall commented 8 years ago

I feel like in that case the extract method should return an iterable of one item. It doesn't necessarily need to open the file or anything, it's just that file objects in python are Iterables that have a next() method that allows them to be run through line by line.

saylorsd commented 8 years ago

@bsmithgall Does it make sense to bypass the use of marshmallow's schema.load() which is used in pipeline.hand_line()? I was thinking of creating another class under shema.py that isn't a marshmallow schema but has a load() method that does what we need for binary files. This way pipeline.run() can stay the same.

Does this make sense? Do you have somethign else in mind?

bsmithgall commented 8 years ago

You could attach a method to the BaseSchema called skip_load or something which defaults to false and is then overridden by the schema.

handle_line is a method on an Extractor, so it shouldn't be a huge problem as long as there is an iterable to process in the pipeline's run method.

saylorsd commented 8 years ago

Sounds good! I'll try that. And my bad I meant pipeline.load_line().

bsmithgall commented 8 years ago

Ah. It would also be not out of the question to make a special BinaryFilePipeline which subclassed pipeline and overrode specific methods.