airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
14.99k stars 3.86k forks source link

Source: Apify - No properties node in stream schema #24701

Open tybernstein opened 1 year ago

tybernstein commented 1 year ago
## Environment - **Airbyte version**: Cloud - **Step where error happened**: Sync job ## Current Behavior Replication tab is unable to display source schema. Sync fails due to the following error message: ``` 2023-03-30 13:09:55 normalization > Traceback (most recent call last): 2023-03-30 13:09:55 normalization > File "/usr/local/bin/transform-catalog", line 8, in 2023-03-30 13:09:55 normalization > sys.exit(main()) 2023-03-30 13:09:55 normalization > File "/usr/local/lib/python3.9/site-packages/normalization/transform_catalog/transform.py", line 111, in main 2023-03-30 13:09:55 normalization > TransformCatalog().run(args) 2023-03-30 13:09:55 normalization > File "/usr/local/lib/python3.9/site-packages/normalization/transform_catalog/transform.py", line 36, in run 2023-03-30 13:09:55 normalization > self.process_catalog() 2023-03-30 13:09:55 normalization > File "/usr/local/lib/python3.9/site-packages/normalization/transform_catalog/transform.py", line 64, in process_catalog 2023-03-30 13:09:55 normalization > processor.process(catalog_file=catalog_file, json_column_name=json_col, default_schema=schema) 2023-03-30 13:09:55 normalization > File "/usr/local/lib/python3.9/site-packages/normalization/transform_catalog/catalog_processor.py", line 55, in process 2023-03-30 13:09:55 normalization > stream_processors = self.build_stream_processor( 2023-03-30 13:09:55 normalization > File "/usr/local/lib/python3.9/site-packages/normalization/transform_catalog/catalog_processor.py", line 146, in build_stream_processor 2023-03-30 13:09:55 normalization > properties = get_field(get_field(stream_config, "json_schema", message), "properties", message) 2023-03-30 13:09:55 normalization > File "/usr/local/lib/python3.9/site-packages/normalization/transform_catalog/catalog_processor.py", line 238, in get_field 2023-03-30 13:09:55 normalization > raise KeyError(message) 2023-03-30 13:09:55 normalization > KeyError: "'json_schema'.'properties' are not defined for stream DatasetItems" ``` ## Expected Behavior Connector should be able to recognize dataset schema, and sync successfully. ## Logs [tyler__airbyte_logs_1689166_txt.txt](https://github.com/airbytehq/airbyte/files/11112458/tyler__airbyte_logs_1689166_txt.txt) ## Steps to Reproduce 1. Setup an Apify Source Connector 2. Sync to any Destination Connector 3. Even if sync is successful no data is synced
erohmensing commented 1 year ago

The offending area of code: https://github.com/airbytehq/airbyte/blob/2e099acc52aeddc172c5ee66ed98426c69449a4a/airbyte-integrations/connectors/source-apify-dataset/source_apify_dataset/source.py#L86-L90

It looks like it defines the dataset as an object, but doesn't define its properties. Probably needs a smarter rework of dynamic schema discovery (since it could be any dataset, we don't know the shape of the data)

erohmensing commented 1 year ago

Interestingly enough, discover is fine with this, it's normalization that breaks, I guess because it doesn't know how to imply the types of the data that are coming though

Edit: fails without normalization too: image

wkargul commented 10 months ago

Hey there! šŸ˜Š

I've encountered a similar issue but when using the Files source connector. Here's what I'm passing as the file:

[["2000-06-05",116],["2000-06-06",129],["2000-06-07",135],["2000-06-08",86]]

I'm getting the same error as mentioned above. Any insights? šŸ™