Open vade opened 1 year ago
Specific failure seems to be a mis-reading of types in the source:
2023-03-22 23:22:42 destination > {'error': [{'message': "invalid string property 'total_tested' on class 'Test': not a string, but json.Number"}]}
Thanks for the issue @vade any chance you could post the csv file as well?
Apologies. Its from your example CSVs:
https://storage.googleapis.com/covid19-open-data/v2/latest/epidemiology.csv
Now that im wrapping my brain around the functionality of Airbyte, I think I can more clearly see the issue.
The CSV files:
date,key,new_confirmed,new_deceased,new_recovered,new_tested,total_confirmed,total_deceased,total_recovered,total_tested
2020-09-24,AE,1002,1,,93618,88532,407,,9130551
2020-09-24,AF,0,0,,,39170,1451,,
2020-09-24,AM,392,2,,,48643,947,,
2020-09-24,AT,688,6,,18518,41246,783,,1507782
2020-09-24,AU,10,2,,47634,26983,861,,7441327
Whats interesting, is yesterdays File Source (0.2.34) had the incorrect data types on the source side:
But updating to 0.2.35 now has the correct seeming field types in the source schema:
However I still have destination errors implying the Weaviate destination connector is not inferring the types correctly
2023-03-23 17:37:22 destination > {'error': [{'message': "invalid string property 'new_recovered' on class 'Test': not a string, but json.Number"}]}
2023-03-23 17:37:22 destination > {'error': [{'message': "invalid string property 'new_recovered' on class 'Test': not a string, but json.Number"}]}
2023-03-23 17:37:22 destination > {'error': [{'message': "invalid string property 'new_recovered' on class 'Test': not a string, but json.Number"}]}
2023-03-23 17:37:22 destination > {'error': [{'message': "invalid string property 'new_recovered' on class 'Test': not a string, but json.Number"}]}
2023-03-23 17:37:22 destination > {'error': [{'message': "invalid string property 'total_recovered' on class 'Test': not a string, but json.Number"}]}
2023-03-23 17:37:22 destination > {'error': [{'message': "invalid string property 'total_recovered' on class 'Test': not a string, but json.Number"}]}
2023-03-23 17:37:22 destination > {'error': [{'message': "invalid string property 'new_recovered' on class 'Test': not a string, but json.Number"}]}
2023-03-23 17:37:22 destination > {'error': [{'message': "invalid string property 'new_recovered' on class 'Test': not a string, but json.Number"}]}
Full log for error with 0.2.35 File source and Weaviate 0.1.1
Im wondering now if the issue is that of null values. I know that Weaviate requires null value support needing to be added to the schema via additional optional keys on the Weaviate invertedIndexConfig which is GLOBAL for all class properties on the class schema.
I imagine a simple change to the Weaviate destination by appending a fixed set of best practice properties might solve. ill try to setup the dev env and do make a PR?
"invertedIndexConfig": { // Optional, index configuration
"indexTimestamps": false, // Optional, maintains inverted indices for each object by its internal timestamps
"indexNullState": false, // Optional, maintains inverted indices for each property regarding its null state
"indexPropertyLength": false // Optional, maintains inverted indices for each property by its length
},
I think I've isolated the issue, but am unsure how to best fix as I am new to Airbyte.
The issue is a combination of a few scenarios
1) The data source has some optionally null elements. Ie some values are missing and are assumed to be null (None python types) 2) The Weaviate destination connector relies on auto schema feature of the Weaviate server to set up the schema definitions, which does not give the input schema as introspected by Airbyte a chance to actually apply its inferred values. 3) It seems as though Auto Schema in Weaviate appears to fall back to string types on None input.
Ive verified this in theory. Evidence:
The first non header row of the CSV:
Note that total_recovered
and new_recovered
have null / None entries.
Now note the schema generated automatically from Weaviate: both entries use a different data type, resulting in a schema data type collision.
{
"dataType": [
"string"
],
"description": "This property was generated by Weaviate's auto-schema feature on Thu Mar 23 17:34:55 2023",
"name": "total_recovered",
"tokenization": "word"
},
...
{
"dataType": [
"string"
],
"description": "This property was generated by Weaviate's auto-schema feature on Thu Mar 23 17:34:55 2023",
"name": "new_recovered",
"tokenization": "word"
},
The solution to this bug is to ensure that the Weaviate connector manually creates the schema and does not rely on Autoschema. This also provides opportunity for the Airbyte destination to have additional customization.
For the @aibyte team - What is the best entry point or message location to generate the client side schema. Akin to 'when do I create a table for a Postgres destination'? I'll look that up.
Im also working on a PR that enables batching as well.
You can still precreate your Weaviate schema instead of relying on autoschema when using the airbyte connector. This is helpful in case you want to ensure all properties are of the correct type or if you have other specific configuration you want to ensure is present such as moduleConfig in a class.
Environment
Airbyte version: 0.42.0
OS Version / Instance: example macOS, Windows 7/10, Ubuntu 18.04, GCP n2. , AWS EC2 macOS 13.2.1
Deployment: example are Docker or Kubernetes deploy env Docker
Source Connector and version: File : specifically
https://storage.googleapis.com/covid19-open-data/v2/latest/epidemiology.csv
Destination Connector and version: Weaviate - Alpha
Step where error happened: Sync
Current Behavior
It seems as though the Weaviate connector is having issues deducing data types from the CSV file from what I can gather, as some data has made it over to Weaviate.
Expected Behavior
Tell us what should happen.
All Epidemiology.csv is ingested.
Logs
e2fc4898_fd69_43c3_9375_22c937c2ff99_logs_3_txt.txt
Steps to Reproduce
Step 1: Setting up a CSV file source from the example Epidemiology.csv samples Step 2: Setting up A new fresh Weaviate instance via docker-compose, using fresh Weaviate 1.18.1 Step 3: Attempting a sync - failure after 3 tries
Are you willing to submit a PR?
Potentially! Im new to Airbyte but am happy to help how I can.