datahub-project / datahub

The Metadata Platform for your Data and AI Stack
https://datahubproject.io
Apache License 2.0
9.93k stars 2.94k forks source link

avro.io.AvroTypeException: The datum is not an example of the schema #2078

Closed eeennn closed 3 years ago

eeennn commented 3 years ago

mysql_etl_error.txt why is not the datum an example of the schema? I used the official python file, the path is /datahub-master/contrib/metadata-ingestion/python/mysql-etl/mysql_etl.py What did I do wrong? thanks your reply

shirshanka commented 3 years ago

Thanks for reporting it. Looks like a bug in the python script. Your time field is a float, when it should be a long. 'lastModified': {'time': 1612314316.6865983, ...

Could you use the ingestion script from the top-level metadata-ingestion directory? https://github.com/linkedin/datahub/blob/master/metadata-ingestion/sql-etl/mysql_etl.py

The bug seems to be fixed there: https://github.com/linkedin/datahub/commit/fa58c2d161ce5ee6295ca5bfa2f7b5041dac759d

We'll fix the ingestion scripts.

eeennn commented 3 years ago

thanks your reply! I use this python script(/metadata-ingestion/sql-etl/mysql_etl.py), time field is changed, but the same wrong happened.

File "F:\Python_Project\datahub\venv\lib\site-packages\avro\io.py", line 809, in write raise AvroTypeException(self.writer_schema, datum) avro.io.AvroTypeException: The datum {'auditHeader': None, 'proposedSnapshot': ('com.linkedin.pegasus2avro.metadata.snapshot.DatasetSnapshot', {'urn': 'urn:li:dataset:(urn:li:dataPlatform:mysql,atlas.lineage_relation,PROD)', 'aspects': [('com.linkedin.pegasus2avro.schema.SchemaMetadata', {'schemaName': 'atlas.lineage_relation', 'platform': 'urn:li:dataPlatform:mysql', 'version': 0, 'created': {'time': 1612334922000, 'actor': 'urn:li:corpuser:etl'}, 'lastModified': {'time': 1612334922000, 'actor': 'urn:li:corpuser:etl'}, 'hash': '', 'platformSchema': {'tableSchema': ''}, 'fields': [{'fieldPath': 'relationship_id', 'nativeDataType': 'VARCHAR(length=255)', 'type': {'type': ('com.linkedin.pegasus2avro.schema.StringType', {})}, 'description': None}, {'fieldPath': 'from_entity_id', 'nativeDataType': 'VARCHAR(length=255)', 'type': {'type': ('com.linkedin.pegasus2avro.schema.StringType', {})}, 'description': None}, {'fieldPath': 'to_entity_id', 'nativeDataType': 'VARCHAR(length=255)', 'type': {'type': ('com.linkedin.pegasus2avro.schema.StringType', {})}, 'description': None}]})]}), 'proposedDelta': None} is not an example of the schema { "type": "record", "name": "MetadataChangeEvent", "namespace": "com.linkedin.pegasus2avro.mxe", "fields": [ { "type": [ "null", {

eeennn commented 3 years ago

i use this(https://github.com/linkedin/datahub/issues/1953) datum, but also has the datum is not an example of the schema, Is there a problem with the MetadataChangeEvent.avsc v1 or v2?

shirshanka commented 3 years ago

I just merged in a fix for the dependency issue, can you re-pull and check? https://github.com/linkedin/datahub/pull/2082

eeennn commented 3 years ago

and this python dependencies(https://github.com/linkedin/datahub/blob/master/metadata-ingestion/sql-etl/common.txt) has an error. ERROR: Cannot install avro-python3==1.8.2 and confluent-kafka[avro]==1.4.0 because these package versions have conflicting dependencies. The conflict is caused by: The user requested avro-python3==1.8.2 confluent-kafka[avro] 1.4.0 depends on avro-python3==1.9.2.1; python_version >= "3.0" and extra == "avro" ...

eeennn commented 3 years ago

maybe need confluent-kafka[avro] <=1.3.0 due to avro-python3==1.8.2? `ERROR: Cannot install avro-python3==1.8.2 and confluent-kafka[avro]==1.5.0 because these package versions have conflicting dependencies.

The conflict is caused by: The user requested avro-python3==1.8.2 confluent-kafka[avro] 1.5.0 depends on avro-python3==1.10.0; python_version > "3.0" and extra == "avro"`

hsheth2 commented 3 years ago

That's quite odd - you're totally right that it should be a conflict. And yet, the following commands worked on my machine:

# In the sql-etl directory.
python -m venv venv
source ./venv/bin/activate
pip install -r common.txt -r mysql_etl.txt
python mysql_etl.py  # after adding my own mysql instance URI

For reference, this produced the following set of deps:

$ pip freeze
avro-python3==1.8.2
certifi==2020.12.5
chardet==4.0.0
confluent-kafka==1.6.0
fastavro==1.3.1
idna==2.10
PyMySQL==0.9.3
requests==2.25.1
SQLAlchemy==1.3.17
urllib3==1.26.3
eeennn commented 3 years ago

hello, it also not work, it's so weird. I used the following commands. # In the sql-etl directory. python -m venv venv source ./venv/bin/activate pip install -r common.txt -r mysql_etl.txt python mysql_etl.py this is my deps. (venv) [root@datahub-gms sql-etl]# pip freeze avro-python3==1.8.2 certifi==2020.12.5 chardet==4.0.0 confluent-kafka==1.6.0 dataclasses==0.8 fastavro==1.3.1 idna==2.10 PyMySQL==0.9.3 requests==2.25.1 SQLAlchemy==1.3.17 urllib3==1.26.3

eeennn commented 3 years ago

thanks, it's work, I use the new script (https://github.com/linkedin/datahub/tree/master/metadata-ingestion)