Closed saylorsd closed 8 years ago
xlrd is also quite good.
@bsmithgall Whenever you have a chance, can you take a look at 5adb251f5c75aeb875be8c42c46249bee5df1ece?
In the process of getting the ExcelExtractor
to work, I ended up making a few changes. Primarily I moved the data loading work that was in the pipeline to the extractor this way we don't have to make any assumptions of how the extractor deals with lines of data.
I'm able to use the ExcelExtractor
with SFTPFiles
created by the SFTPConnector
but have been racking my brain to come up with a connection-independent solution (you'll see my tests fail when they use a local file).
I'm not a real fan of that implementation change -- you now have two full copies of the data (one on the Extractor
object and one on the Pipeline
object). I would recommend having the extractor keep in one element of an Iterable
(as delivered through iteration in the run
method).
Can you run some one-offs and post traceback? It's hard to tell why ExcelExtractor
would fail with a FileTarget
@bsmithgall I see what you're getting at there. I've modified the ExcelExtractor
to create a list of rows pretty much and then had process_connection()
return iter(that_list)
.
I didn't explain the second part well, sorry about that. The issue was more about not being able to pass SFTPFile
s into xlrd
as it's not subscriptable so I had to read them to pass them in in binary. When done with TextIOWrapper
objects it was causing an issue. I think I fixed this by having FileConnector
read files as binary when encoding=None
.
However, now (5ca91700f545b984c1bf0a818b3db73b0fba017f) I'm trying to get process_connection
to work properly for ExcelExtractor
. My problem is i can't think of a way to read it line by line without keeping an extra copy of the whole thing as a Sheet
object in the Extractor
and then treating that as an Iterable
with some helper functions. From what i understand about the file format, I'll have to pass the whole file into xlrd.open_workbook
, so I can't iterate over the raw file like text and CSVs.
Any ideas?
@bsmithgall I understand you're probably super busy right now - so no rush - but whenever you have a chance can you take a look at fa070e92aa0a9aef8fcf768c9ebc4979d8963434? I changed the SFTPConnector
to pretty much convert the SFTPFile
into a io.BytesIO
to pretty much copy the file over to memory and make operations on it simpler and more like the other connector.
+1 that seems like a great idea
On Thu, Feb 18, 2016 at 10:09 AM, Steven Saylor notifications@github.com wrote:
@bsmithgall https://github.com/bsmithgall I understand you're probably super busy right now - so no rush - but whenever you have a chance can you take a look at fa070e9 https://github.com/UCSUR-Pitt/wprdc-etl/commit/fa070e92aa0a9aef8fcf768c9ebc4979d8963434? I changed the SFTPConnector to pretty much convert the SFTPFile into a io.BytesIO to pretty much copy the file over to memory and make operations on it simpler and more like the other connector.
— Reply to this email directly or view it on GitHub https://github.com/UCSUR-Pitt/wprdc-etl/issues/34#issuecomment-185761783 .
One of our upcoming datasets is coming to us, in part, as an excel xlsx file. It looks like openpyxl will be helpful.