Sheets parameter for the load processor

roll commented 4 years ago

fixes https://github.com/frictionlessdata/datapackage-pipelines/issues/188

roll commented 4 years ago

Hi @akariv @cschloer,

It's my second take on this issue. The first attempt was https://github.com/datahq/dataflows/pull/110

The new one uses tabulator's workbook_cache argument for XLSX which allows downloading every workbook only once - https://github.com/frictionlessdata/tabulator-py#xlsxlsx-read--write

I think dataflows is a better place for high-level logic as sheets management and cache control so my current idea that tabulator just needs to provide enough low-level tools to be managed by dataflows

cschloer commented 4 years ago

I will take a look!

akariv commented 4 years ago

Hey @roll -

Wouldn't it be better if there was a standard way to open a 'container' kind of file - for example, an Excel file or Google Spreadsheet with multiple sheets or a ZIP file with multiple files. Then you would be able to iterate on them with a consistent API.

This implementation basically re-opens the Excel file for each of the sheets, reads a sample, infers a schema - and then checks to see if the sheet name adheres to the sheets regular expression... we might have solved the re-downloading problem but it's still very inefficient.

I'm thinking we could have a generic class similar to Stream, which would be Container:

>>> container = tabulator.Container('path/to/excel/file.xlsx')
# OR 
>>> container = tabulator.Container('path/to/archive/file.zip')
>>> for item in container.iter():
...  print(item.name) # could be sheet name or filename in zipfile
...  print(item.options) # dict of options to be used in Stream, e.g. '{"sheet": 1}' or {"compression": "zip"} etc.
... stream = item.stream(**other_options)  # Returns a Stream object

then you could do:

FILENAME = 'path/to/excel/file.xlsx'
Flow(*[
   load(FILENAME, headers=1, name='res-%d' % idx, **item.options)
   for i, item in enumerate(Container(FILENAME).iter())
]).process()

roll commented 4 years ago

@akariv Yes sure, I agree. And once I have time I will do something like this (we have some container concepts already like Package or Storage; maybe we can use them; may some new one). BTW, I'm not sure the currently used technique is really a noticeable overhead as ZIP-files should support proper random-access.

Aside from implementation, what do you think the best API will be for DPP? For dataflows it's easy to imaging but we need it also in a declarative form.

akariv commented 4 years ago

To answer the DPP question - off the top of my head. We could have a load_batch processor (in dataflows and in DPP) with the following parameter:

selection - list of REs / predicate functions to control filename/sheet selection

Internally it will iterate on the different parts (using the tabulator.Collection object ! 😄 ) and will generate a bunch of loads as necessary.

Example selections:

["https://path/to/file.xls", "Expenses.+"] Select all expenses sheets from an excel file on the web
["/path/to/file.zip", ".+\.csv"] Select all csv files in a zip file
["https://path/to/file.zip", "report.+\.xlsx", "Raw Data"] Select the 'Raw Data' sheet in all report excel files in a zip file on the web

resource name can be a slug of the filenames/sheetnames of the different parts combined (we can complicate it later)

roll commented 4 years ago

@akariv Cool, it sounds good. I doubt I will have enough time to implement tabulator.Collection/etc before the end of the current stage of the BCO-DMO pilot so I would suggest:

to agree on load_batch API (@cschloer please take a look if it makes sense for you)
I will implement the first version of load_batch probably using the current state of tabulator
We will switch to tabulator.Collection on the next iteration (only implementation-wise so it will not be breaking for users) WDYT?

cschloer commented 4 years ago

Our custom load processor already has something relatively akin to "load_batch" - it can take a comma separated list of urls or a regular expression (for local or s3) url, and then it just generates a bunch of load steps for each resulting URL. But if your implementation will improve the load times for multiple sheets within an xlsx file, I am happy to switch over to your implementation. :+1:

datahq / dataflows

Sheets parameter for the load processor #138