airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
16.22k stars 4.14k forks source link

Destination Weaviate - Epidemiology.csv example file source not functioning #24378

Open vade opened 1 year ago

vade commented 1 year ago

Environment

Current Behavior

It seems as though the Weaviate connector is having issues deducing data types from the CSV file from what I can gather, as some data has made it over to Weaviate.

Expected Behavior

Tell us what should happen.

All Epidemiology.csv is ingested.

Logs

e2fc4898_fd69_43c3_9375_22c937c2ff99_logs_3_txt.txt

Steps to Reproduce

Step 1: Setting up a CSV file source from the example Epidemiology.csv samples Step 2: Setting up A new fresh Weaviate instance via docker-compose, using fresh Weaviate 1.18.1 Step 3: Attempting a sync - failure after 3 tries

Are you willing to submit a PR?

Potentially! Im new to Airbyte but am happy to help how I can.

vade commented 1 year ago

Specific failure seems to be a mis-reading of types in the source:

2023-03-22 23:22:42 destination > {'error': [{'message': "invalid string property 'total_tested' on class 'Test': not a string, but json.Number"}]}

sajarin commented 1 year ago

Thanks for the issue @vade any chance you could post the csv file as well?

vade commented 1 year ago

Apologies. Its from your example CSVs:

https://storage.googleapis.com/covid19-open-data/v2/latest/epidemiology.csv

vade commented 1 year ago

Now that im wrapping my brain around the functionality of Airbyte, I think I can more clearly see the issue.

The CSV files:

date,key,new_confirmed,new_deceased,new_recovered,new_tested,total_confirmed,total_deceased,total_recovered,total_tested
2020-09-24,AE,1002,1,,93618,88532,407,,9130551
2020-09-24,AF,0,0,,,39170,1451,,
2020-09-24,AM,392,2,,,48643,947,,
2020-09-24,AT,688,6,,18518,41246,783,,1507782
2020-09-24,AU,10,2,,47634,26983,861,,7441327

Whats interesting, is yesterdays File Source (0.2.34) had the incorrect data types on the source side: image

But updating to 0.2.35 now has the correct seeming field types in the source schema:

image

However I still have destination errors implying the Weaviate destination connector is not inferring the types correctly

2023-03-23 17:37:22 destination > {'error': [{'message': "invalid string property 'new_recovered' on class 'Test': not a string, but json.Number"}]}
2023-03-23 17:37:22 destination > {'error': [{'message': "invalid string property 'new_recovered' on class 'Test': not a string, but json.Number"}]}
2023-03-23 17:37:22 destination > {'error': [{'message': "invalid string property 'new_recovered' on class 'Test': not a string, but json.Number"}]}
2023-03-23 17:37:22 destination > {'error': [{'message': "invalid string property 'new_recovered' on class 'Test': not a string, but json.Number"}]}
2023-03-23 17:37:22 destination > {'error': [{'message': "invalid string property 'total_recovered' on class 'Test': not a string, but json.Number"}]}
2023-03-23 17:37:22 destination > {'error': [{'message': "invalid string property 'total_recovered' on class 'Test': not a string, but json.Number"}]}
2023-03-23 17:37:22 destination > {'error': [{'message': "invalid string property 'new_recovered' on class 'Test': not a string, but json.Number"}]}
2023-03-23 17:37:22 destination > {'error': [{'message': "invalid string property 'new_recovered' on class 'Test': not a string, but json.Number"}]}
vade commented 1 year ago

Full log for error with 0.2.35 File source and Weaviate 0.1.1

e2fc4898_fd69_43c3_9375_22c937c2ff99_logs_4_txt.txt

vade commented 1 year ago

Im wondering now if the issue is that of null values. I know that Weaviate requires null value support needing to be added to the schema via additional optional keys on the Weaviate invertedIndexConfig which is GLOBAL for all class properties on the class schema.

I imagine a simple change to the Weaviate destination by appending a fixed set of best practice properties might solve. ill try to setup the dev env and do make a PR?

"invertedIndexConfig": {                  // Optional, index configuration
    "indexTimestamps": false,               // Optional, maintains inverted indices for each object by its internal timestamps
    "indexNullState": false,                // Optional, maintains inverted indices for each property regarding its null state
    "indexPropertyLength": false            // Optional, maintains inverted indices for each property by its length
  },
vade commented 1 year ago

I think I've isolated the issue, but am unsure how to best fix as I am new to Airbyte.

The issue is a combination of a few scenarios

1) The data source has some optionally null elements. Ie some values are missing and are assumed to be null (None python types) 2) The Weaviate destination connector relies on auto schema feature of the Weaviate server to set up the schema definitions, which does not give the input schema as introspected by Airbyte a chance to actually apply its inferred values. 3) It seems as though Auto Schema in Weaviate appears to fall back to string types on None input.

Ive verified this in theory. Evidence:

The first non header row of the CSV:

image

Note that total_recovered and new_recovered have null / None entries.

Now note the schema generated automatically from Weaviate: both entries use a different data type, resulting in a schema data type collision.

        {
          "dataType": [
            "string"
          ],
          "description": "This property was generated by Weaviate's auto-schema feature on Thu Mar 23 17:34:55 2023",
          "name": "total_recovered",
          "tokenization": "word"
        },

        ...

        {
          "dataType": [
            "string"
          ],
          "description": "This property was generated by Weaviate's auto-schema feature on Thu Mar 23 17:34:55 2023",
          "name": "new_recovered",
          "tokenization": "word"
        },

The solution to this bug is to ensure that the Weaviate connector manually creates the schema and does not rely on Autoschema. This also provides opportunity for the Airbyte destination to have additional customization.

For the @aibyte team - What is the best entry point or message location to generate the client side schema. Akin to 'when do I create a table for a Postgres destination'? I'll look that up.

Im also working on a PR that enables batching as well.

samos123 commented 1 year ago

You can still precreate your Weaviate schema instead of relying on autoschema when using the airbyte connector. This is helpful in case you want to ensure all properties are of the correct type or if you have other specific configuration you want to ensure is present such as moduleConfig in a class.