WPRDC / wprdc-etl

MIT License
8 stars 3 forks source link

Write xlsx extractor #34

Closed saylorsd closed 8 years ago

saylorsd commented 8 years ago

One of our upcoming datasets is coming to us, in part, as an excel xlsx file. It looks like openpyxl will be helpful.

bsmithgall commented 8 years ago

xlrd is also quite good.

saylorsd commented 8 years ago

@bsmithgall Whenever you have a chance, can you take a look at 5adb251f5c75aeb875be8c42c46249bee5df1ece?

In the process of getting the ExcelExtractor to work, I ended up making a few changes. Primarily I moved the data loading work that was in the pipeline to the extractor this way we don't have to make any assumptions of how the extractor deals with lines of data.

I'm able to use the ExcelExtractor with SFTPFiles created by the SFTPConnector but have been racking my brain to come up with a connection-independent solution (you'll see my tests fail when they use a local file).

bsmithgall commented 8 years ago

I'm not a real fan of that implementation change -- you now have two full copies of the data (one on the Extractor object and one on the Pipeline object). I would recommend having the extractor keep in one element of an Iterable (as delivered through iteration in the run method).

Can you run some one-offs and post traceback? It's hard to tell why ExcelExtractor would fail with a FileTarget

saylorsd commented 8 years ago

@bsmithgall I see what you're getting at there. I've modified the ExcelExtractor to create a list of rows pretty much and then had process_connection() return iter(that_list).

I didn't explain the second part well, sorry about that. The issue was more about not being able to pass SFTPFiles into xlrd as it's not subscriptable so I had to read them to pass them in in binary. When done with TextIOWrapper objects it was causing an issue. I think I fixed this by having FileConnector read files as binary when encoding=None.

However, now (5ca91700f545b984c1bf0a818b3db73b0fba017f) I'm trying to get process_connection to work properly for ExcelExtractor. My problem is i can't think of a way to read it line by line without keeping an extra copy of the whole thing as a Sheet object in the Extractor and then treating that as an Iterable with some helper functions. From what i understand about the file format, I'll have to pass the whole file into xlrd.open_workbook, so I can't iterate over the raw file like text and CSVs.

Any ideas?

saylorsd commented 8 years ago

@bsmithgall I understand you're probably super busy right now - so no rush - but whenever you have a chance can you take a look at fa070e92aa0a9aef8fcf768c9ebc4979d8963434? I changed the SFTPConnector to pretty much convert the SFTPFile into a io.BytesIO to pretty much copy the file over to memory and make operations on it simpler and more like the other connector.

bsmithgall commented 8 years ago

+1 that seems like a great idea

On Thu, Feb 18, 2016 at 10:09 AM, Steven Saylor notifications@github.com wrote:

@bsmithgall https://github.com/bsmithgall I understand you're probably super busy right now - so no rush - but whenever you have a chance can you take a look at fa070e9 https://github.com/UCSUR-Pitt/wprdc-etl/commit/fa070e92aa0a9aef8fcf768c9ebc4979d8963434? I changed the SFTPConnector to pretty much convert the SFTPFile into a io.BytesIO to pretty much copy the file over to memory and make operations on it simpler and more like the other connector.

— Reply to this email directly or view it on GitHub https://github.com/UCSUR-Pitt/wprdc-etl/issues/34#issuecomment-185761783 .