The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
## Environment
- **Airbyte version**: 0.40.28
- **OS Version / Instance**: MacOS
- **Deployment**: Docker
- **Source Connector and version**: source-file 0.2.33
- **Destination Connector and version**: n/a
- **Step where error happened**: Setup new connection check
## Current Behavior
Check is failing on File source connector when configured to extract a gzipped CSV using the HTTPS provider.
I also tested the local provider, this behaviour did not occur. I did not test S3, GCP or other providers.
I added `{"compression": "gzip"}` to Reader options as per [read_csv docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#quoting-compression-and-file-format) and [examples in the File connector doc](https://docs.airbyte.com/integrations/sources/file/#examples).
Got a parsing error:
```
The connection tests failed.
Failed to load https:// Please check File Format and Reader Options are set correctly
ConfigurationError('File <_io.TextIOWrapper name=\'uc\' encoding=\'UTF-8\'> can\'t be parsed with reader of chosen type (csv)
```
## Expected Behavior
Check should pass
## Logs
Logs (expand)
```
Error: Failed to load {{URL here}}
Please check File Format and Reader Options are set correctly
ConfigurationError('File <_io.TextIOWrapper name=\'\' encoding=\'UTF-8\'> can\'t be parsed with reader of chosen type (csv)\nTraceback (most recent call last):\n File "/airbyte/integration_code/source_file/client.py", line 328, in load_dataframes\n yield from reader(fp, **reader_options)\n File "/usr/local/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper\n return func(*args, **kwargs)\n File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv\n return _read(filepath_or_buffer, kwds)\n File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 575, in _read\n parser = TextFileReader(filepath_or_buffer, **kwds)\n File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 934, in __init__\n self._engine = self._make_engine(f, self.engine)\n File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1236, in _make_engine\n return mapping[engine](f, **self.options)\n File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 75, in __init__\n self._reader = parsers.TextReader(src, **kwds)\n File "pandas/_libs/parsers.pyx", line 544, in pandas._libs.parsers.TextReader.__cinit__\n File "pandas/_libs/parsers.pyx", line 633, in pandas._libs.parsers.TextReader._get_header\n File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows\n File "pandas/_libs/parsers.pyx", line 1952, in pandas._libs.parsers.raise_parser_error\nUnicodeDecodeError: \'utf-8\' codec can\'t decode byte 0x8b in position 1: invalid start byte\n')
Traceback (most recent call last):
File "/airbyte/integration_code/source_file/client.py", line 328, in load_dataframes
yield from reader(fp, **reader_options)
File "/usr/local/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 575, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 934, in __init__
self._engine = self._make_engine(f, self.engine)
File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1236, in _make_engine
return mapping[engine](f, **self.options)
File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 75, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 544, in pandas._libs.parsers.TextReader.__cinit__
File "pandas/_libs/parsers.pyx", line 633, in pandas._libs.parsers.TextReader._get_header
File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 1952, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/airbyte/integration_code/source_file/source.py", line 112, in check
list(client.streams)
File "/airbyte/integration_code/source_file/client.py", line 422, in streams
"properties": self._stream_properties(fp),
File "/airbyte/integration_code/source_file/client.py", line 404, in _stream_properties
for df in df_list:
File "/airbyte/integration_code/source_file/client.py", line 337, in load_dataframes
raise ConfigurationError(error_msg) from err
source_file.client.ConfigurationError: File <_io.TextIOWrapper name='' encoding='UTF-8'> can't be parsed with reader of chosen type (csv)
Traceback (most recent call last):
File "/airbyte/integration_code/source_file/client.py", line 328, in load_dataframes
yield from reader(fp, **reader_options)
File "/usr/local/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 575, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 934, in __init__
self._engine = self._make_engine(f, self.engine)
File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1236, in _make_engine
return mapping[engine](f, **self.options)
File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 75, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 544, in pandas._libs.parsers.TextReader.__cinit__
File "pandas/_libs/parsers.pyx", line 633, in pandas._libs.parsers.TextReader._get_header
File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 1952, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
```
## Steps to Reproduce
1. Create or find a .csv.gz file, I ran `gzip addresses.csv` with a [sample csv](https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html) and [store it somewhere accessible by HTTPS](https://docs.airbyte.com/integrations/sources/file/#storage-providers)
2. Configure File source connector as follows:
File format: CSV
Storage provider: HTTPS
Reader options: `{"compression": "gzip"}`
3. Check connection fails
Logs (expand)
``` Error: Failed to load {{URL here}} Please check File Format and Reader Options are set correctly ConfigurationError('File <_io.TextIOWrapper name=\'\' encoding=\'UTF-8\'> can\'t be parsed with reader of chosen type (csv)\nTraceback (most recent call last):\n File "/airbyte/integration_code/source_file/client.py", line 328, in load_dataframes\n yield from reader(fp, **reader_options)\n File "/usr/local/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper\n return func(*args, **kwargs)\n File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv\n return _read(filepath_or_buffer, kwds)\n File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 575, in _read\n parser = TextFileReader(filepath_or_buffer, **kwds)\n File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 934, in __init__\n self._engine = self._make_engine(f, self.engine)\n File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1236, in _make_engine\n return mapping[engine](f, **self.options)\n File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 75, in __init__\n self._reader = parsers.TextReader(src, **kwds)\n File "pandas/_libs/parsers.pyx", line 544, in pandas._libs.parsers.TextReader.__cinit__\n File "pandas/_libs/parsers.pyx", line 633, in pandas._libs.parsers.TextReader._get_header\n File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows\n File "pandas/_libs/parsers.pyx", line 1952, in pandas._libs.parsers.raise_parser_error\nUnicodeDecodeError: \'utf-8\' codec can\'t decode byte 0x8b in position 1: invalid start byte\n') Traceback (most recent call last): File "/airbyte/integration_code/source_file/client.py", line 328, in load_dataframes yield from reader(fp, **reader_options) File "/usr/local/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv return _read(filepath_or_buffer, kwds) File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 575, in _read parser = TextFileReader(filepath_or_buffer, **kwds) File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 934, in __init__ self._engine = self._make_engine(f, self.engine) File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1236, in _make_engine return mapping[engine](f, **self.options) File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 75, in __init__ self._reader = parsers.TextReader(src, **kwds) File "pandas/_libs/parsers.pyx", line 544, in pandas._libs.parsers.TextReader.__cinit__ File "pandas/_libs/parsers.pyx", line 633, in pandas._libs.parsers.TextReader._get_header File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows File "pandas/_libs/parsers.pyx", line 1952, in pandas._libs.parsers.raise_parser_error UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/airbyte/integration_code/source_file/source.py", line 112, in check list(client.streams) File "/airbyte/integration_code/source_file/client.py", line 422, in streams "properties": self._stream_properties(fp), File "/airbyte/integration_code/source_file/client.py", line 404, in _stream_properties for df in df_list: File "/airbyte/integration_code/source_file/client.py", line 337, in load_dataframes raise ConfigurationError(error_msg) from err source_file.client.ConfigurationError: File <_io.TextIOWrapper name='' encoding='UTF-8'> can't be parsed with reader of chosen type (csv) Traceback (most recent call last): File "/airbyte/integration_code/source_file/client.py", line 328, in load_dataframes yield from reader(fp, **reader_options) File "/usr/local/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv return _read(filepath_or_buffer, kwds) File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 575, in _read parser = TextFileReader(filepath_or_buffer, **kwds) File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 934, in __init__ self._engine = self._make_engine(f, self.engine) File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1236, in _make_engine return mapping[engine](f, **self.options) File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 75, in __init__ self._reader = parsers.TextReader(src, **kwds) File "pandas/_libs/parsers.pyx", line 544, in pandas._libs.parsers.TextReader.__cinit__ File "pandas/_libs/parsers.pyx", line 633, in pandas._libs.parsers.TextReader._get_header File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows File "pandas/_libs/parsers.pyx", line 1952, in pandas._libs.parsers.raise_parser_error UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte ```