ets / tap-spreadsheets-anywhere

GNU Affero General Public License v3.0
31 stars 63 forks source link

Bug when reading `.xlsx` files. Excel files not properly tapped and no output with `ERROR Unable to write Catalog entry for 'filexlsx' - it will be skipped due to error File is not a zip file` #71

Open AlexisVLRT opened 1 year ago

AlexisVLRT commented 1 year ago

I'm running into unexpected behaviour when trying to tap into an excel file.

It is being detected, however it's not actually read and nothing ends up in the output dir, seemingly due to an File is not a zip file error.

I'm running python 10.0 on an M1 OSX 11.6.4 Big Sur with the following packages:

❯ pip freeze
aiodocker==0.21.0
aiohttp==3.8.6
aiosignal==1.3.1
alembic==1.12.1
async-timeout==4.0.3
atomicwrites==1.4.1
attrs==23.1.0
certifi==2023.7.22
charset-normalizer==3.3.2
check-jsonschema==0.22.0
click==8.1.7
click-default-group==1.2.4
click-didyoumean==0.3.0
croniter==1.4.1
distlib==0.3.7
et-xmlfile==1.1.0
fasteners==0.19
filelock==3.13.1
flatten-dict==0.4.2
frozenlist==1.4.0
idna==3.4
importlib-resources==6.1.0
jdcal==1.4.1
Jinja2==3.1.2
jsonschema==4.19.2
jsonschema-specifications==2023.7.1
Mako==1.2.4
markdown-it-py==3.0.0
MarkupSafe==2.1.3
mdurl==0.1.2
meltano==3.1.0
multidict==6.0.4
openpyxl==3.1.2
packaging==23.2
platformdirs==3.11.0
prompt-toolkit==3.0.36
psutil==5.9.6
Pygments==2.16.1
PyJWT==2.8.0
python-dateutil==2.8.2
python-dotenv==1.0.0
python-slugify==8.0.1
python-ulid==1.1.0
PyYAML==6.0.1
questionary==2.0.1
referencing==0.30.2
requests==2.31.0
rich==13.6.0
rpds-py==0.12.0
ruamel.yaml==0.17.21
ruamel.yaml.clib==0.2.8
six==1.16.0
smart-open==6.4.0
snowplow-tracker==1.0.1
SQLAlchemy==2.0.23
structlog==23.2.0
tabulate==0.9.0
termcolor==2.3.0
text-unidecode==1.3
typing_extensions==4.8.0
tzlocal==5.2
urllib3==2.0.7
virtualenv==20.24.6
wcwidth==0.2.9
yarl==1.9.2
yaspin==2.5.0

meltano.yml

version: 1
default_environment: dev
project_id: c37dc81d-e555-4e23-9f9f-0b6acbdfff86
environments:
- name: dev
- name: staging
- name: prod
plugins:
  extractors:
  - name: tap-spreadsheets-anywhere
    variant: ets
    pip_url: git+https://github.com/ets/tap-spreadsheets-anywhere.git
    capabilities:
    - discover
    config:
      tables: [
        {
          "path": "file:///Users/alexis.vialaret/vscode_projects/EDA_Accelerator/data",
          "name": "billionaires_excelxlsx",
          "pattern": ".xlsx",
          "key_properties": [],
          "format": "excel",
          "worksheet_name": "Feuille 1",
          "start_date": "1970-01-01T00:00:00+00:00"
        }
      ]
  loaders:
  - name: target-jsonl
    variant: andyh1203
    pip_url: target-jsonl

Here is the file I'm testing with. It is a valid excel file: billionaires_excel.xlsx

Shell output:

❯ meltano run tap-spreadsheets-anywhere target-jsonl

2023-11-07T12:18:26.121508Z [info     ] Environment 'dev' is active
2023-11-07T12:18:26.214962Z [warning  ] A state file was found, but it will be ignored as the extractor does not advertise the `state` capability
2023-11-07T12:18:26.925482Z [warning  ] A catalog file was found, but it will be ignored as the extractor does not advertise the `catalog` or `properties` capability
2023-11-07T12:18:26.925679Z [warning  ] A state file was found, but it will be ignored as the extractor does not advertise the `state` capability
2023-11-07T12:18:27.382167Z [info     ] INFO Generating catalog through sampling. cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-11-07T12:18:27.382454Z [info     ] INFO Walking /Users/alexis.vialaret/vscode_projects/EDA_Accelerator/data. cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-11-07T12:18:27.382596Z [info     ] INFO Found 2 files.            cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-11-07T12:18:27.382683Z [info     ] INFO Checking 2 resolved objects for any that match regular expression ".xlsx" and were modified since 1970-01-01 00:00:00+00:00 cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-11-07T12:18:27.382901Z [info     ] INFO Processing 1 resolved objects that met our criteria. Enable debug verbosity logging for more details. cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-11-07T12:18:27.383281Z [info     ] INFO Sampling billionaires_excel.xlsx (1000 records, every 5th record). cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-11-07T12:18:27.383774Z [info     ] ERROR Unable to write Catalog entry for 'billionaires_excelxlsx' - it will be skipped due to error File is not a zip file cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-11-07T12:18:27.383932Z [info     ] INFO Processing 0 selected streams from Catalog cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-11-07T12:18:27.446698Z [info     ] Block run completed.           block_type=ExtractLoadBlocks err=None set_number=0 success=True```

I've investigated a bit, and it seems like the issue might be coming from the way the excel file is passed to openpyxl in `excel_handler.py`:

```python3
[...]
def get_row_iterator(table_spec, file_handle):
    workbook = openpyxl.load_workbook(file_handle, read_only=True)
[...]

file_handle is an _io.TextIOWrapper which openpyxl.load_workbook does not seem to accept.

Here is a minimum reproducible example:

import smart_open
from openpyxl import load_workbook

data = smart_open.open(
    'file:///path/to/file.xlsx', 
    'rb', 
    newline=None, 
    errors='surrogateescape', 
    encoding='utf-8'
)
print(data)

# This fails
workbook = load_workbook(data, read_only=True)

# This works
# workbook = load_workbook(data.buffer, read_only=True)

# This also works, but openpyxl will re-open the file to read it
# workbook = load_workbook(data.name, read_only=True)

zipfile.BadZipFile: File is not a zip file

It looks like a fix would be to pass data.buffer rather than just data. That works in my minimal example and solves the problem of the file not being tapped, but I'm lacking context knowledge to be sure this is a good idea.

What do you think?