airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
15.22k stars 3.92k forks source link

[source-s3] python cdk csv parser does not support null_values when a column is an object/dict #36773

Open adam-bloom opened 4 months ago

adam-bloom commented 4 months ago

Connector Name

source-s3

Connector Version

4.4.0

What step the error happened?

During the sync

Relevant information

This is a CDK issue, not a connector issue - likely impacts multiple connectors.

Things start going south in the csv_format config spec: https://github.com/airbytehq/airbyte/blob/b27757f80e3a6c72636c8c3280a3e36dedfeaf2a/airbyte-cdk/python/airbyte_cdk/sources/file_based/config/csv_format.py#L111-L115 This defines a type of Set[str], but the default value is a list and not a set.

Jumping over some plumbing, this value is used here: https://github.com/airbytehq/airbyte/blob/b27757f80e3a6c72636c8c3280a3e36dedfeaf2a/airbyte-cdk/python/airbyte_cdk/sources/file_based/file_types/csv_parser.py#L239

Value has already been type casted. If you have mapped a column to an object, it is now a dict. null_values is typed again as Set[str], but is actually a list [] if the default is used. The problem is as follows:

  1. checking if a dict is in a list - works
    >>> {'test': 'foo'} in []
    False
  2. checking if a dict is in a set - fails
    >>> {'test': 'foo'} in set([])
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    TypeError: unhashable type: 'dict'

If you pass any non-default config to null_values (from any connector) and have a column mapped as an object/dict, then you'll end up with errors in your connection log from this. See below for an example. Leaving that config option unset works as expected, since we're comparing to a list.

An easy fix would be to change the value in null_values check to be value in list(null_values). Additionally, passing a list to a parameter that is typed as a set is very confusing and should be resolved.

Relevant log output

2024-04-02 21:58:18 source > Error parsing record. This could be due to a mismatch between the config's file type and the actual file type, or because the file or record is not parseable. stream=<stream name> file=<file name> line_no=0 n_skipped=0
Stack Trace: Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/file_based/stream/default_file_based_stream.py", line 90, in read_records_from_slice
    for record in parser.parse_records(self.config, file, self.stream_reader, self.logger, schema):
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/file_based/file_types/csv_parser.py", line 193, in parse_records
    yield CsvParser._to_nullable(
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/file_based/file_types/csv_parser.py", line 220, in _to_nullable
    nullable = {
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/file_based/file_types/csv_parser.py", line 221, in <dictcomp>
    k: None if CsvParser._value_is_none(v, deduped_property_types.get(k), null_values, strings_can_be_null) else v
  File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/file_based/file_types/csv_parser.py", line 228, in _value_is_none
    return value in null_values and (strings_can_be_null or deduped_property_type != "string")
TypeError: unhashable type: 'dict'

Contribute

marcosmarxm commented 4 months ago

@airbytehq/critical-connectors can this be considered in next grooming section? Thanks!