datahq / dataflows

DataFlows is a simple, intuitive lightweight framework for building data processing flows in python.
https://dataflows.org
MIT License
194 stars 39 forks source link

Loading multiple excel sheets #110

Closed roll closed 4 years ago

roll commented 4 years ago
coveralls commented 4 years ago

Pull Request Test Coverage Report for Build 363


Totals Coverage Status
Change from base Build 359: 0.2%
Covered Lines: 1634
Relevant Lines: 1935

💛 - Coveralls
roll commented 4 years ago

@akariv Please take a look (cc @cschloer)

akariv commented 4 years ago

@roll I have no special comment about the implementation - however I think that it's not the right location for adding this sort of support.

Imagine the scenario of a big excel file with many sheets stored somewhere online. If the user uses this feature, this file will be downloaded n+1 times (n is the number of sheets), each sheet will be read and its schema will be inferred - all for getting one sheet only.

A tabulator based solution could look like this - we create a Streams interface in tabulator, which will expose a bunch of Streams (based on index and/or name). A Streams object could be created from a list of Stream creation options - it will create separate Stream objects, possibly sharing the Loader (thus avoiding the need to re-download the same file). Some changes to the existing loaders will have to be made in order to support multiple use by more than one Stream. Other creation options would be source based (e.g. for multi-sheet Excel/GSheet/ODS, multi-table sql etc.). Naturally, Streams will also expose a 'get all stream names' which will be mapped to (for example) all sheet names when working with Excel.

wdyt?

roll commented 4 years ago

@akariv I agree that it can very ineffective.

The tabulator.Streams is the best option probably. Although maybe the problem can be resolved with caching on the tabulator level without introducing new APIs but not sure...

Anyway, @cschloer, I would say that, for now, I see that the only option is to use this code in your custom BCO-DMO processor (load or separate load_multiple like) because the required changes to tabulator are massive and probably it's not possible to do it in this iteration of work. WDYT?

cschloer commented 4 years ago

Hmm okay. I've been resisting overwriting the dataflows load in my custom load processor (so I can keep getting updates from new dataflows versions), but maybe I can figure out a way to get this working within my custom load processor. And maybe something that avoids streaming all of the data, and just gets the sheet names? Thanks for the ideas @roll , I'll update you when I have something and you can look it over if interested

cschloer commented 4 years ago

Just throwing this out here:

I was able to use the xlrd open_workbook fucntion with on_demand=True to not load the entire spreadsheet. Then using sheet_names() to get the sheet names and run regex on them.

        if parameters.get('sheet_regex', False):
            '''
            Handling a regular expression sheet name
            '''
            xls = xlrd.open_workbook(url, on_demand=True)
            sheet_names = xls.sheet_names()
            sheet_regex = parameters.pop('sheet', '')
            for sheet_name in sheet_names:
                if re.match(sheet_regex, sheet_name):
                    ....add to the load....
roll commented 4 years ago

@cschloer The problem with tabulator that this program handles all data source as abstract objects. For example, BCO-DMO software can have knowledge that all files are local and treat them accordingly. But tabulator has to support remote cases, byte streams etc. And to do this in some generalized way.

BTW, is it possible to run this code on BCO-DMO side? And then just use the standard load processor?

roll commented 4 years ago

I close it for now as WONTFIX

cschloer commented 4 years ago

Hey, sorry I didn't understand that xlrd_open_workbook() was only working on local files. You are totally correct! This is not too much of a limitation for us, as most of our files come from local paths as you said. I've already added this to my own load processor and am able to keep using the dataflows load so I am good to go :+1:

cschloer commented 4 years ago

@roll We are moving our infrastructure to start using remote files inside of local file paths (s3 urls) so the solution I original made does not work (or rather it loads the file n+1 times). Do you think we could revisit the solution suggested by @akariv of handling multiple excel sheets within tabulator?