Bug when reading `.xlsx` files. Excel files not properly tapped and no output with `ERROR Unable to write Catalog entry for 'filexlsx' - it will be skipped due to error File is not a zip file` #71
❯ meltano run tap-spreadsheets-anywhere target-jsonl
2023-11-07T12:18:26.121508Z [info ] Environment 'dev' is active
2023-11-07T12:18:26.214962Z [warning ] A state file was found, but it will be ignored as the extractor does not advertise the `state` capability
2023-11-07T12:18:26.925482Z [warning ] A catalog file was found, but it will be ignored as the extractor does not advertise the `catalog` or `properties` capability
2023-11-07T12:18:26.925679Z [warning ] A state file was found, but it will be ignored as the extractor does not advertise the `state` capability
2023-11-07T12:18:27.382167Z [info ] INFO Generating catalog through sampling. cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-11-07T12:18:27.382454Z [info ] INFO Walking /Users/alexis.vialaret/vscode_projects/EDA_Accelerator/data. cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-11-07T12:18:27.382596Z [info ] INFO Found 2 files. cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-11-07T12:18:27.382683Z [info ] INFO Checking 2 resolved objects for any that match regular expression ".xlsx" and were modified since 1970-01-01 00:00:00+00:00 cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-11-07T12:18:27.382901Z [info ] INFO Processing 1 resolved objects that met our criteria. Enable debug verbosity logging for more details. cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-11-07T12:18:27.383281Z [info ] INFO Sampling billionaires_excel.xlsx (1000 records, every 5th record). cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-11-07T12:18:27.383774Z [info ] ERROR Unable to write Catalog entry for 'billionaires_excelxlsx' - it will be skipped due to error File is not a zip file cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-11-07T12:18:27.383932Z [info ] INFO Processing 0 selected streams from Catalog cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-11-07T12:18:27.446698Z [info ] Block run completed. block_type=ExtractLoadBlocks err=None set_number=0 success=True```
I've investigated a bit, and it seems like the issue might be coming from the way the excel file is passed to openpyxl in `excel_handler.py`:
```python3
[...]
def get_row_iterator(table_spec, file_handle):
workbook = openpyxl.load_workbook(file_handle, read_only=True)
[...]
file_handle is an _io.TextIOWrapper which openpyxl.load_workbook does not seem to accept.
Here is a minimum reproducible example:
import smart_open
from openpyxl import load_workbook
data = smart_open.open(
'file:///path/to/file.xlsx',
'rb',
newline=None,
errors='surrogateescape',
encoding='utf-8'
)
print(data)
# This fails
workbook = load_workbook(data, read_only=True)
# This works
# workbook = load_workbook(data.buffer, read_only=True)
# This also works, but openpyxl will re-open the file to read it
# workbook = load_workbook(data.name, read_only=True)
zipfile.BadZipFile: File is not a zip file
It looks like a fix would be to pass data.buffer rather than just data. That works in my minimal example and solves the problem of the file not being tapped, but I'm lacking context knowledge to be sure this is a good idea.
I'm running into unexpected behaviour when trying to tap into an excel file.
It is being detected, however it's not actually read and nothing ends up in the
output
dir, seemingly due to anFile is not a zip file
error.I'm running python 10.0 on an M1 OSX 11.6.4 Big Sur with the following packages:
meltano.yml
Here is the file I'm testing with. It is a valid excel file: billionaires_excel.xlsx
Shell output:
file_handle
is an_io.TextIOWrapper
whichopenpyxl.load_workbook
does not seem to accept.Here is a minimum reproducible example:
It looks like a fix would be to pass
data.buffer
rather than justdata
. That works in my minimal example and solves the problem of the file not being tapped, but I'm lacking context knowledge to be sure this is a good idea.What do you think?