datahub-project / datahub

The Metadata Platform for your Data Stack
https://datahubproject.io
Apache License 2.0
9.45k stars 2.8k forks source link

Undocumented breaking change in Ownership model of 0.13.0 #10136

Open nclaeys opened 3 months ago

nclaeys commented 3 months ago

datahub issue:

Describe the bug In the latest release 0.13.0, the property ownerTypes was added to the Ownership object, which is not documented as a breaking change. If you use an updated client, then it cannot ingest new aspects until the gms is updated to the same version. This is because the gms rejects the schema as it does not recognise the ownerTypes property

To Reproduce Steps to reproduce the behavior:

  1. Install gms 0.12.1
  2. use the acryl-datahub-airflow-plugin 0.13.0 to ingest lineage information
  3. ingest the lineage and the emitter will fail to send the metadata to datahub
  4. See error
[2024-03-25, 00:18:32 UTC] {datahub_plugin_v22.py:88} ERROR - Error sending metadata to datahub: ('Unable to emit metadata to DataHub GMS: Failed to validate record with class com.linkedin.common.Ownership: ERROR :: /ownerTypes :: unrecognized field found but not allowed\n', {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'message': 'Failed to validate record with class com.linkedin.common.Ownership: ERROR :: /ownerTypes :: unrecognized field found but not allowed\n', 'status': 422})
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.11/site-packages/datahub/emitter/rest_emitter.py", line 285, in _emit_generic
    response.raise_for_status()
  File "/home/airflow/.local/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 422 Client Error: Unprocessable Entity for url: https://datahub.rainman.dataminded.cloud/api/gms/aspects?action=ingestProposal
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.11/site-packages/datahub/emitter/rest_emitter.py", line 223, in emit
    self.emit_mcp(item)
  File "/home/airflow/.local/lib/python3.11/site-packages/datahub/emitter/rest_emitter.py", line 264, in emit_mcp
    self._emit_generic(url, payload)
  File "/home/airflow/.local/lib/python3.11/site-packages/datahub/emitter/rest_emitter.py", line 293, in _emit_generic
    raise OperationalError(
datahub.configuration.common.OperationalError: ('Unable to emit metadata to DataHub GMS: Failed to validate record with class com.linkedin.common.Ownership: ERROR :: /ownerTypes :: unrecognized field found but not allowed\n', {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'message': 'Failed to validate record with class com.linkedin.common.Ownership: ERROR :: /ownerTypes :: unrecognized field found but not allowed\n', 'status': 422})
[2024-03-25, 00:18:32 UTC] {logging_mixin.py:188} INFO - Exception: Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.11/site-packages/datahub/emitter/rest_emitter.py", line 285, in _emit_generic
    response.raise_for_status()
  File "/home/airflow/.local/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 422 Client Error: Unprocessable Entity for url: https://datahub.rainman.dataminded.cloud/api/gms/aspects?action=ingestProposal
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.11/site-packages/datahub_airflow_plugin/datahub_plugin_v22.py", line 262, in custom_on_success_callback
    datahub_task_status_callback(context, status=InstanceRunResult.SUCCESS)
  File "/home/airflow/.local/lib/python3.11/site-packages/datahub_airflow_plugin/datahub_plugin_v22.py", line 115, in datahub_task_status_callback
    dataflow.emit(emitter, callback=_make_emit_callback(task.log))
  File "/home/airflow/.local/lib/python3.11/site-packages/datahub/api/entities/datajob/dataflow.py", line 171, in emit
    emitter.emit(mcp, callback)
  File "/home/airflow/.local/lib/python3.11/site-packages/datahub/emitter/rest_emitter.py", line 223, in emit
    self.emit_mcp(item)
  File "/home/airflow/.local/lib/python3.11/site-packages/datahub/emitter/rest_emitter.py", line 264, in emit_mcp
    self._emit_generic(url, payload)
  File "/home/airflow/.local/lib/python3.11/site-packages/datahub/emitter/rest_emitter.py", line 293, in _emit_generic
    raise OperationalError(
datahub.configuration.common.OperationalError: ('Unable to emit metadata to DataHub GMS: Failed to validate record with class com.linkedin.common.Ownership: ERROR :: /ownerTypes :: unrecognized field found but not allowed\n', {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'message': 'Failed to validate record with class com.linkedin.common.Ownership: ERROR :: /ownerTypes :: unrecognized field found but not allowed\n', 'status': 422})

The change was introduced in this commit: https://github.com/datahub-project/datahub/commit/ed10a8d8cca3b17e982db6d14ea435833c5a87ea#diff-d08f131b5220b63a5f3ce2e254f76e9ce0de6ac14f00cae2be14b553d0e9a7a4

Expected behavior Document that this is a breaking change such that people make sure to check. Now we had customers updating their client version without knowing that it would break the metadata ingestion .

Screenshots

Metadata rejected call to datahub:

curl -X POST -H 'User-Agent: python-requests/2.31.0' -H 'Accept-Encoding: gzip, deflate' -H 'Accept: */*' -H 'Connection: keep-alive' -H 'X-RestLi-Protocol-Version: 2.0.0' -H 'Content-Type: application/json' -H 'Authorization: <redacted>' --data '{"proposal": {"entityType": "dataFlow", "entityUrn": "urn:li:dataFlow:(airflow,airflow-task,prod)", "changeType": "UPSERT", "aspectName": "ownership", "aspect": {"value": "{\"owners\": [{\"owner\": \"urn:li:corpuser:SomeUser\", \"type\": \"DEVELOPER\", \"source\": {\"type\": \"SERVICE\"}}], \"ownerTypes\": {}, \"lastModified\": {\"time\": 0, \"actor\": \"urn:li:corpuser:airflow\"}}", "contentType": "application/json"}}}' 'https://<url>/api/gms/aspects?action=ingestProposal'

In version 0.12.1.5 of the acryl-datahub-airflow-plugin metadata update succeeds:

curl -X POST -H 'User-Agent: python-requests/2.31.0' -H 'Accept-Encoding: gzip, deflate' -H 'Accept: */*' -H 'Connection: keep-alive' -H 'X-RestLi-Protocol-Version: 2.0.0' -H 'Content-Type: application/json' -H 'Authorization: <redacted>' --data '{"proposal": {"entityType": "dataJob", "entityUrn": "urn:li:dataJob:(urn:li:dataFlow:(airflow,airflow-task,prod),ingest-weather-mx_nano)", "changeType": "UPSERT", "aspectName": "ownership", "aspect": {"value": "{\"owners\": [{\"owner\": \"urn:li:corpuser:SomeUser\", \"type\": \"DEVELOPER\", \"source\": {\"type\": \"SERVICE\"}}], \"lastModified\": {\"time\": 0, \"actor\": \"urn:li:corpuser:airflow\"}}", "contentType": "application/json"}}}' 'https://<url>/api/gms/aspects?action=ingestProposal'

Desktop (please complete the following information):

Additional context Add any other context about the problem here.

gabrielwry commented 3 months ago

We are facing the same issue, so i guess upgrade gms to 0.13 will solve it?

younesidhamou commented 2 months ago

We're encountering the same issue. has anyone found a solution?

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io