Open roll opened 4 years ago
Hi @akariv @cschloer,
It's my second take on this issue. The first attempt was https://github.com/datahq/dataflows/pull/110
The new one uses tabulator
's workbook_cache
argument for XLSX which allows downloading every workbook only once - https://github.com/frictionlessdata/tabulator-py#xlsxlsx-read--write
I think dataflows
is a better place for high-level logic as sheets management and cache control so my current idea that tabulator
just needs to provide enough low-level tools to be managed by dataflows
I will take a look!
Hey @roll -
Wouldn't it be better if there was a standard way to open a 'container' kind of file - for example, an Excel file or Google Spreadsheet with multiple sheets or a ZIP file with multiple files. Then you would be able to iterate on them with a consistent API.
This implementation basically re-opens the Excel file for each of the sheets, reads a sample, infers a schema - and then checks to see if the sheet name adheres to the sheets
regular expression... we might have solved the re-downloading problem but it's still very inefficient.
I'm thinking we could have a generic class similar to Stream
, which would be Container
:
>>> container = tabulator.Container('path/to/excel/file.xlsx')
# OR
>>> container = tabulator.Container('path/to/archive/file.zip')
>>> for item in container.iter():
... print(item.name) # could be sheet name or filename in zipfile
... print(item.options) # dict of options to be used in Stream, e.g. '{"sheet": 1}' or {"compression": "zip"} etc.
... stream = item.stream(**other_options) # Returns a Stream object
then you could do:
FILENAME = 'path/to/excel/file.xlsx'
Flow(*[
load(FILENAME, headers=1, name='res-%d' % idx, **item.options)
for i, item in enumerate(Container(FILENAME).iter())
]).process()
@akariv Yes sure, I agree. And once I have time I will do something like this (we have some container concepts already like Package or Storage; maybe we can use them; may some new one). BTW, I'm not sure the currently used technique is really a noticeable overhead as ZIP-files should support proper random-access.
Aside from implementation, what do you think the best API will be for DPP? For dataflows
it's easy to imaging but we need it also in a declarative form.
To answer the DPP question - off the top of my head.
We could have a load_batch
processor (in dataflows and in DPP) with the following parameter:
selection
- list of REs / predicate functions to control filename/sheet selectionInternally it will iterate on the different parts (using the tabulator.Collection
object ! 😄 ) and will generate a bunch of load
s as necessary.
Example selection
s:
["https://path/to/file.xls", "Expenses.+"]
Select all expenses sheets from an excel file on the web["/path/to/file.zip", ".+\.csv"]
Select all csv files in a zip file["https://path/to/file.zip", "report.+\.xlsx", "Raw Data"]
Select the 'Raw Data' sheet in all report excel files in a zip file on the webresource name can be a slug of the filenames/sheetnames of the different parts combined (we can complicate it later)
@akariv
Cool, it sounds good. I doubt I will have enough time to implement tabulator.Collection/etc
before the end of the current stage of the BCO-DMO pilot so I would suggest:
load_batch
API (@cschloer please take a look if it makes sense for you)load_batch
probably using the current state of tabulator
tabulator.Collection
on the next iteration (only implementation-wise so it will not be breaking for users)
WDYT?Our custom load processor already has something relatively akin to "load_batch" - it can take a comma separated list of urls or a regular expression (for local or s3) url, and then it just generates a bunch of load steps for each resulting URL. But if your implementation will improve the load times for multiple sheets within an xlsx file, I am happy to switch over to your implementation. :+1: