Open saylorsd opened 8 years ago
I feel like in that case the extract
method should return an iterable of one item. It doesn't necessarily need to open the file or anything, it's just that file objects in python are Iterables that have a next()
method that allows them to be run through line by line.
@bsmithgall Does it make sense to bypass the use of marshmallow's schema.load()
which is used in pipeline.hand_line()
? I was thinking of creating another class under shema.py that isn't a marshmallow schema but has a load()
method that does what we need for binary files. This way pipeline.run()
can stay the same.
Does this make sense? Do you have somethign else in mind?
You could attach a method to the BaseSchema called skip_load
or something which defaults to false and is then overridden by the schema.
handle_line
is a method on an Extractor
, so it shouldn't be a huge problem as long as there is an iterable to process in the pipeline's run
method.
Sounds good! I'll try that. And my bad I meant pipeline.load_line()
.
Ah. It would also be not out of the question to make a special BinaryFilePipeline
which subclassed pipeline and overrode specific methods.
:question: @bsmithgall
For files that don't lend themselves to data extraction (pdfs, images, etc) what should we do about the
extract()
part of the pipeline?From what I can tell here, it seems like the current process requires that the input file be broken down line by line.
Would it make sense to query
pipeline._extractor
to see if handles lines, or just how it generally handles data, and go from there?This is also something that can probably wait, as most of our current datasets will work just fine with the currently implemented
loader
.