airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
15.45k stars 3.98k forks source link

Source File: check failing on gzipped CSV using the HTTPS provider #21573

Open sh4sh opened 1 year ago

sh4sh commented 1 year ago
## Environment - **Airbyte version**: 0.40.28 - **OS Version / Instance**: MacOS - **Deployment**: Docker - **Source Connector and version**: source-file 0.2.33 - **Destination Connector and version**: n/a - **Step where error happened**: Setup new connection check ## Current Behavior Check is failing on File source connector when configured to extract a gzipped CSV using the HTTPS provider. I also tested the local provider, this behaviour did not occur. I did not test S3, GCP or other providers. I added `{"compression": "gzip"}` to Reader options as per [read_csv docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#quoting-compression-and-file-format) and [examples in the File connector doc](https://docs.airbyte.com/integrations/sources/file/#examples). Got a parsing error: ``` The connection tests failed. Failed to load https:// Please check File Format and Reader Options are set correctly ConfigurationError('File <_io.TextIOWrapper name=\'uc\' encoding=\'UTF-8\'> can\'t be parsed with reader of chosen type (csv) ``` ## Expected Behavior Check should pass ## Logs
Logs (expand) ``` Error: Failed to load {{URL here}} Please check File Format and Reader Options are set correctly ConfigurationError('File <_io.TextIOWrapper name=\'\' encoding=\'UTF-8\'> can\'t be parsed with reader of chosen type (csv)\nTraceback (most recent call last):\n File "/airbyte/integration_code/source_file/client.py", line 328, in load_dataframes\n yield from reader(fp, **reader_options)\n File "/usr/local/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper\n return func(*args, **kwargs)\n File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv\n return _read(filepath_or_buffer, kwds)\n File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 575, in _read\n parser = TextFileReader(filepath_or_buffer, **kwds)\n File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 934, in __init__\n self._engine = self._make_engine(f, self.engine)\n File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1236, in _make_engine\n return mapping[engine](f, **self.options)\n File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 75, in __init__\n self._reader = parsers.TextReader(src, **kwds)\n File "pandas/_libs/parsers.pyx", line 544, in pandas._libs.parsers.TextReader.__cinit__\n File "pandas/_libs/parsers.pyx", line 633, in pandas._libs.parsers.TextReader._get_header\n File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows\n File "pandas/_libs/parsers.pyx", line 1952, in pandas._libs.parsers.raise_parser_error\nUnicodeDecodeError: \'utf-8\' codec can\'t decode byte 0x8b in position 1: invalid start byte\n') Traceback (most recent call last): File "/airbyte/integration_code/source_file/client.py", line 328, in load_dataframes yield from reader(fp, **reader_options) File "/usr/local/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv return _read(filepath_or_buffer, kwds) File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 575, in _read parser = TextFileReader(filepath_or_buffer, **kwds) File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 934, in __init__ self._engine = self._make_engine(f, self.engine) File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1236, in _make_engine return mapping[engine](f, **self.options) File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 75, in __init__ self._reader = parsers.TextReader(src, **kwds) File "pandas/_libs/parsers.pyx", line 544, in pandas._libs.parsers.TextReader.__cinit__ File "pandas/_libs/parsers.pyx", line 633, in pandas._libs.parsers.TextReader._get_header File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows File "pandas/_libs/parsers.pyx", line 1952, in pandas._libs.parsers.raise_parser_error UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/airbyte/integration_code/source_file/source.py", line 112, in check list(client.streams) File "/airbyte/integration_code/source_file/client.py", line 422, in streams "properties": self._stream_properties(fp), File "/airbyte/integration_code/source_file/client.py", line 404, in _stream_properties for df in df_list: File "/airbyte/integration_code/source_file/client.py", line 337, in load_dataframes raise ConfigurationError(error_msg) from err source_file.client.ConfigurationError: File <_io.TextIOWrapper name='' encoding='UTF-8'> can't be parsed with reader of chosen type (csv) Traceback (most recent call last): File "/airbyte/integration_code/source_file/client.py", line 328, in load_dataframes yield from reader(fp, **reader_options) File "/usr/local/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv return _read(filepath_or_buffer, kwds) File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 575, in _read parser = TextFileReader(filepath_or_buffer, **kwds) File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 934, in __init__ self._engine = self._make_engine(f, self.engine) File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1236, in _make_engine return mapping[engine](f, **self.options) File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 75, in __init__ self._reader = parsers.TextReader(src, **kwds) File "pandas/_libs/parsers.pyx", line 544, in pandas._libs.parsers.TextReader.__cinit__ File "pandas/_libs/parsers.pyx", line 633, in pandas._libs.parsers.TextReader._get_header File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows File "pandas/_libs/parsers.pyx", line 1952, in pandas._libs.parsers.raise_parser_error UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte ```
## Steps to Reproduce 1. Create or find a .csv.gz file, I ran `gzip addresses.csv` with a [sample csv](https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html) and [store it somewhere accessible by HTTPS](https://docs.airbyte.com/integrations/sources/file/#storage-providers) 2. Configure File source connector as follows: File format: CSV Storage provider: HTTPS Reader options: `{"compression": "gzip"}` 3. Check connection fails
potatozerg commented 1 year ago

Please please please, really need this one. Thanks!