Closed roll closed 4 years ago
Totals | |
---|---|
Change from base Build 359: | 0.2% |
Covered Lines: | 1634 |
Relevant Lines: | 1935 |
@akariv Please take a look (cc @cschloer)
@roll I have no special comment about the implementation - however I think that it's not the right location for adding this sort of support.
Imagine the scenario of a big excel file with many sheets stored somewhere online. If the user uses this feature, this file will be downloaded n+1 times (n is the number of sheets), each sheet will be read and its schema will be inferred - all for getting one sheet only.
A tabulator based solution could look like this - we create a Streams
interface in tabulator, which will expose a bunch of Stream
s (based on index and/or name).
A Streams
object could be created from a list of Stream
creation options - it will create separate Stream
objects, possibly sharing the Loader (thus avoiding the need to re-download the same file). Some changes to the existing loaders will have to be made in order to support multiple use by more than one Stream
.
Other creation options would be source based (e.g. for multi-sheet Excel/GSheet/ODS, multi-table sql etc.).
Naturally, Streams
will also expose a 'get all stream names' which will be mapped to (for example) all sheet names when working with Excel.
wdyt?
@akariv I agree that it can very ineffective.
The tabulator.Streams
is the best option probably. Although maybe the problem can be resolved with caching on the tabulator
level without introducing new APIs but not sure...
Anyway, @cschloer, I would say that, for now, I see that the only option is to use this code in your custom BCO-DMO processor (load
or separate load_multiple
like) because the required changes to tabulator
are massive and probably it's not possible to do it in this iteration of work. WDYT?
Hmm okay. I've been resisting overwriting the dataflows load
in my custom load processor (so I can keep getting updates from new dataflows versions), but maybe I can figure out a way to get this working within my custom load processor. And maybe something that avoids streaming all of the data, and just gets the sheet names? Thanks for the ideas @roll , I'll update you when I have something and you can look it over if interested
Just throwing this out here:
I was able to use the xlrd open_workbook fucntion with on_demand=True
to not load the entire spreadsheet. Then using sheet_names() to get the sheet names and run regex on them.
if parameters.get('sheet_regex', False):
'''
Handling a regular expression sheet name
'''
xls = xlrd.open_workbook(url, on_demand=True)
sheet_names = xls.sheet_names()
sheet_regex = parameters.pop('sheet', '')
for sheet_name in sheet_names:
if re.match(sheet_regex, sheet_name):
....add to the load....
@cschloer
The problem with tabulator
that this program handles all data source as abstract objects. For example, BCO-DMO software can have knowledge that all files are local and treat them accordingly. But tabulator
has to support remote cases, byte streams etc. And to do this in some generalized way.
BTW, is it possible to run this code on BCO-DMO side? And then just use the standard load
processor?
I close it for now as WONTFIX
Hey, sorry I didn't understand that xlrd_open_workbook() was only working on local files. You are totally correct! This is not too much of a limitation for us, as most of our files come from local paths as you said. I've already added this to my own load processor and am able to keep using the dataflows load so I am good to go :+1:
@roll We are moving our infrastructure to start using remote files inside of local file paths (s3 urls) so the solution I original made does not work (or rather it loads the file n+1 times). Do you think we could revisit the solution suggested by @akariv of handling multiple excel sheets within tabulator?