datahub-project / datahub

The Metadata Platform for your Data Stack
https://datahubproject.io
Apache License 2.0
9.44k stars 2.8k forks source link

DataProcessInstance delete doesn't work (crash) #10538

Open obaltian opened 1 month ago

obaltian commented 1 month ago

Describe the bug It's impossible to delete DataProcessInstance objects in bulk (by filtering by entity-type). It either raises an error or doesn't find anything depending on whether you provide additional filter (e.g. --plaftform=airflow).

Only delete by --urn works, which isn't convenient for managing Datahub content.

To Reproduce

  1. Deploy datahub locally:

    datahub docker quickstart
  2. Ingest sample job & its "start" event:

    
    from datahub.api.entities.datajob import DataFlow, DataJob
    from datahub.api.entities.dataprocess.dataprocess_instance import DataProcessInstance
    from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph

graph = DataHubGraph(DatahubClientConfig(server="http://localhost:8080"))

flow = DataFlow(env="prod", orchestrator="airflow", id="flow_api_simple") flow.emit(graph) job = DataJob(flow_urn=flow.urn, id="job1", name="My Job 1") job.emit(graph) run = DataProcessInstance.from_datajob(datajob=job, id=f"{flow.id}-1") run.emit(graph)

optionally, DataProcessInstance is created event without start

import time run.emit_process_start(graph, int(time.time() * 1000))


3. Try to delete info about job run using CLI:
```sh
datahub delete --platform airflow --entity-type dataProcessInstance
# outputs: 
[2024-05-18 18:54:15,396] INFO     {datahub.cli.delete_cli:341} - Using DataHubGraph: # configured to talk to http://localhost:8080
Found no urns to delete. Maybe you want to change your filters to be something different?

datahub delete --entity-type dataProcessInstance
# outputs
[2024-05-18 18:53:54,557] INFO     {datahub.cli.delete_cli:341} - Using DataHubGraph: configured to talk to http://localhost:8080
[2024-05-18 18:53:55,059] ERROR    {datahub.entrypoints:205} - Command failed: Error executing graphql query: [{'message': "The field at path '/scrollAcrossEntities/searchResults[0]/entity' was declared as a non null type, but the code involved in retrieving data has wrongly returned a null value.  The graphql specification requires that the parent field be set to null, or if that is non nullable that it bubble up null to its parent and so on. The non-nullable type is 'Entity' within parent type 'SearchResult'", 'path': ['scrollAcrossEntities', 'searchResults', 0, 'entity'], 'extensions': {'classification': 'NullValueInNonNullableField'}}, {'message': "The field at path '/scrollAcrossEntities/searchResults[1]/entity' was declared as a non null type, but the code involved in retrieving data has wrongly returned a null value.  The graphql specification requires that the parent field be set to null, or if that is non nullable that it bubble up null to its parent and so on. The non-nullable type is 'Entity' within parent type 'SearchResult'", 'path': ['scrollAcrossEntities', 'searchResults', 1, 'entity'], 'extensions': {'classification': 'NullValueInNonNullableField'}}, {'message': "The field at path '/scrollAcrossEntities/searchResults[2]/entity' was declared as a non null type, but the code involved in retrieving data has wrongly returned a null value.  The graphql specification requires that the parent field be set to null, or if that is non nullable that it bubble up null to its parent and so on. The non-nullable type is 'Entity' within parent type 'SearchResult'", 'path': ['scrollAcrossEntities', 'searchResults', 2, 'entity'], 'extensions': {'classification': 'NullValueInNonNullableField'}}, {'message': "The field at path '/scrollAcrossEntities/searchResults[3]/entity' was declared as a non null type, but the code involved in retrieving data has wrongly returned a null value.  The graphql specification requires that the parent field be set to null, or if that is non nullable that it bubble up null to its parent and so on. The non-nullable type is 'Entity' within parent type 'SearchResult'", 'path': ['scrollAcrossEntities', 'searchResults', 3, 'entity'], 'extensions': {'classification': 'NullValueInNonNullableField'}}, {'message': "The field at path '/scrollAcrossEntities/searchResults[4]/entity' was declared as a non null type, but the code involved in retrieving data has wrongly returned a null value.  The graphql specification requires that the parent field be set to null, or if that is non nullable that it bubble up null to its parent and so on. The non-nullable type is 'Entity' within parent type 'SearchResult'", 'path': ['scrollAcrossEntities', 'searchResults', 4, 'entity'], 'extensions': {'classification': 'NullValueInNonNullableField'}}, {'message': "The field at path '/scrollAcrossEntities/searchResults[5]/entity' was declared as a non null type, but the code involved in retrieving data has wrongly returned a null value.  The graphql specification requires that the parent field be set to null, or if that is non nullable that it bubble up null to its parent and so on. The non-nullable type is 'Entity' within parent type 'SearchResult'", 'path': ['scrollAcrossEntities', 'searchResults', 5, 'entity'], 'extensions': {'classification': 'NullValueInNonNullableField'}}, {'message': "The field at path '/scrollAcrossEntities/searchResults[6]/entity' was declared as a non null type, but the code involved in retrieving data has wrongly returned a null value.  The graphql specification requires that the parent field be set to null, or if that is non nullable that it bubble up null to its parent and so on. The non-nullable type is 'Entity' within parent type 'SearchResult'", 'path': ['scrollAcrossEntities', 'searchResults', 6, 'entity'], 'extensions': {'classification': 'NullValueInNonNullableField'}}, {'message': "The field at path '/scrollAcrossEntities/searchResults[7]/entity' was declared as a non null type, but the code involved in retrieving data has wrongly returned a null value.  The graphql specification requires that the parent field be set to null, or if that is non nullable that it bubble up null to its parent and so on. The non-nullable type is 'Entity' within parent type 'SearchResult'", 'path': ['scrollAcrossEntities', 'searchResults', 7, 'entity'], 'extensions': {'classification': 'NullValueInNonNullableField'}}, {'message': "The field at path '/scrollAcrossEntities/searchResults[8]/entity' was declared as a non null type, but the code involved in retrieving data has wrongly returned a null value.  The graphql specification requires that the parent field be set to null, or if that is non nullable that it bubble up null to its parent and so on. The non-nullable type is 'Entity' within parent type 'SearchResult'", 'path': ['scrollAcrossEntities', 'searchResults', 8, 'entity'], 'extensions': {'classification': 'NullValueInNonNullableField'}}, {'message': "The field at path '/scrollAcrossEntities/searchResults[9]/entity' was declared as a non null type, but the code involved in retrieving data has wrongly returned a null value.  The graphql specification requires that the parent field be set to null, or if that is non nullable that it bubble up null to its parent and so on. The non-nullable type is 'Entity' within parent type 'SearchResult'", 'path': ['scrollAcrossEntities', 'searchResults', 9, 'entity'], 'extensions': {'classification': 'NullValueInNonNullableField'}}, {'message': "The field at path '/scrollAcrossEntities/searchResults[10]/entity' was declared as a non null type, but the code involved in retrieving data has wrongly returned a null value.  The graphql specification requires that the parent field be set to null, or if that is non nullable that it bubble up null to its parent and so on. The non-nullable type is 'Entity' within parent type 'SearchResult'", 'path': ['scrollAcrossEntities', 'searchResults', 10, 'entity'], 'extensions': {'classification': 'NullValueInNonNullableField'}}, {'message': "The field at path '/scrollAcrossEntities/searchResults[11]/entity' was declared as a non null type, but the code involved in retrieving data has wrongly returned a null value.  The graphql specification requires that the parent field be set to null, or if that is non nullable that it bubble up null to its parent and so on. The non-nullable type is 'Entity' within parent type 'SearchResult'", 'path': ['scrollAcrossEntities', 'searchResults', 11, 'entity'], 'extensions': {'classification': 'NullValueInNonNullableField'}}, {'message': "The field at path '/scrollAcrossEntities/searchResults[12]/entity' was declared as a non null type, but the code involved in retrieving data has wrongly returned a null value.  The graphql specification requires that the parent field be set to null, or if that is non nullable that it bubble up null to its parent and so on. The non-nullable type is 'Entity' within parent type 'SearchResult'", 'path': ['scrollAcrossEntities', 'searchResults', 12, 'entity'], 'extensions': {'classification': 'NullValueInNonNullableField'}}, {'message': "The field at path '/scrollAcrossEntities/searchResults[13]/entity' was declared as a non null type, but the code involved in retrieving data has wrongly returned a null value.  The graphql specification requires that the parent field be set to null, or if that is non nullable that it bubble up null to its parent and so on. The non-nullable type is 'Entity' within parent type 'SearchResult'", 'path': ['scrollAcrossEntities', 'searchResults', 13, 'entity'], 'extensions': {'classification': 'NullValueInNonNullableField'}}]
Traceback (most recent call last):
  File "/Users/obaltian/maklai/datacatalog-ingestion/.venv/lib/python3.12/site-packages/datahub/entrypoints.py", line 192, in main
    sys.exit(datahub(standalone_mode=False, **kwargs))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/obaltian/maklai/datacatalog-ingestion/.venv/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/obaltian/maklai/datacatalog-ingestion/.venv/lib/python3.12/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/obaltian/maklai/datacatalog-ingestion/.venv/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/obaltian/maklai/datacatalog-ingestion/.venv/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/obaltian/maklai/datacatalog-ingestion/.venv/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/obaltian/maklai/datacatalog-ingestion/.venv/lib/python3.12/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/obaltian/maklai/datacatalog-ingestion/.venv/lib/python3.12/site-packages/datahub/upgrade/upgrade.py", line 396, in async_wrapper
    loop.run_until_complete(run_func_check_upgrade())
  File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/Users/obaltian/maklai/datacatalog-ingestion/.venv/lib/python3.12/site-packages/datahub/upgrade/upgrade.py", line 383, in run_func_check_upgrade
    ret = await main_func_future
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/obaltian/maklai/datacatalog-ingestion/.venv/lib/python3.12/site-packages/datahub/upgrade/upgrade.py", line 378, in run_inner_func
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/obaltian/maklai/datacatalog-ingestion/.venv/lib/python3.12/site-packages/datahub/telemetry/telemetry.py", line 454, in wrapper
    raise e
  File "/Users/obaltian/maklai/datacatalog-ingestion/.venv/lib/python3.12/site-packages/datahub/telemetry/telemetry.py", line 403, in wrapper
    res = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/obaltian/maklai/datacatalog-ingestion/.venv/lib/python3.12/site-packages/datahub/cli/delete_cli.py", line 367, in by_filter
    urns = list(
           ^^^^^
  File "/Users/obaltian/maklai/datacatalog-ingestion/.venv/lib/python3.12/site-packages/datahub/ingestion/graph/client.py", line 782, in get_urns_by_filter
    for entity in self._scroll_across_entities(graphql_query, variables):
  File "/Users/obaltian/maklai/datacatalog-ingestion/.venv/lib/python3.12/site-packages/datahub/ingestion/graph/client.py", line 795, in _scroll_across_entities
    response = self.execute_graphql(
               ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/obaltian/maklai/datacatalog-ingestion/.venv/lib/python3.12/site-packages/datahub/ingestion/graph/client.py", line 883, in execute_graphql
    raise GraphError(f"Error executing graphql query: {result['errors']}")
datahub.configuration.common.GraphError: Error executing graphql query: [{'message': "The field at path '/scrollAcrossEntities/searchResults[0]/entity' was declared as a non null type, but the code involved in retrieving data has wrongly returned a null value.  The graphql specification requires that the parent field be set to null, or if that is non nullable that it bubble up null to its parent and so on. The non-nullable type is 'Entity' within parent type 'SearchResult'", 'path': ['scrollAcrossEntities', 'searchResults', 0, 'entity'], 'extensions': {'classification': 'NullValueInNonNullableField'}}, {'message': "The field at path '/scrollAcrossEntities/searchResults[1]/entity' was declared as a non null type, but the code involved in retrieving data has wrongly returned a null value.  The graphql specification requires that the parent field be set to null, or if that is non nullable that it bubble up null to its parent and so on. The non-nullable type is 'Entity' within parent type 'SearchResult'", 'path': ['scrollAcrossEntities', 'searchResults', 1, 'entity'], 'extensions': {'classification': 'NullValueInNonNullableField'}}, {'message': "The field at path '/scrollAcrossEntities/searchResults[2]/entity' was declared as a non null type, but the code involved in retrieving data has wrongly returned a null value.  The graphql specification requires that the parent field be set to null, or if that is non nullable that it bubble up null to its parent and so on. The non-nullable type is 'Entity' within parent type 'SearchResult'", 'path': ['scrollAcrossEntities', 'searchResults', 2, 'entity'], 'extensions': {'classification': 'NullValueInNonNullableField'}}, {'message': "The field at path '/scrollAcrossEntities/searchResults[3]/entity' was declared as a non null type, but the code involved in retrieving data has wrongly returned a null value.  The graphql specification requires that the parent field be set to null, or if that is non nullable that it bubble up null to its parent and so on. The non-nullable type is 'Entity' within parent type 'SearchResult'", 'path': ['scrollAcrossEntities', 'searchResults', 3, 'entity'], 'extensions': {'classification': 'NullValueInNonNullableField'}}, {'message': "The field at path '/scrollAcrossEntities/searchResults[4]/entity' was declared as a non null type, but the code involved in retrieving data has wrongly returned a null value.  The graphql specification requires that the parent field be set to null, or if that is non nullable that it bubble up null to its parent and so on. The non-nullable type is 'Entity' within parent type 'SearchResult'", 'path': ['scrollAcrossEntities', 'searchResults', 4, 'entity'], 'extensions': {'classification': 'NullValueInNonNullableField'}}, {'message': "The field at path '/scrollAcrossEntities/searchResults[5]/entity' was declared as a non null type, but the code involved in retrieving data has wrongly returned a null value.  The graphql specification requires that the parent field be set to null, or if that is non nullable that it bubble up null to its parent and so on. The non-nullable type is 'Entity' within parent type 'SearchResult'", 'path': ['scrollAcrossEntities', 'searchResults', 5, 'entity'], 'extensions': {'classification': 'NullValueInNonNullableField'}}, {'message': "The field at path '/scrollAcrossEntities/searchResults[6]/entity' was declared as a non null type, but the code involved in retrieving data has wrongly returned a null value.  The graphql specification requires that the parent field be set to null, or if that is non nullable that it bubble up null to its parent and so on. The non-nullable type is 'Entity' within parent type 'SearchResult'", 'path': ['scrollAcrossEntities', 'searchResults', 6, 'entity'], 'extensions': {'classification': 'NullValueInNonNullableField'}}, {'message': "The field at path '/scrollAcrossEntities/searchResults[7]/entity' was declared as a non null type, but the code involved in retrieving data has wrongly returned a null value.  The graphql specification requires that the parent field be set to null, or if that is non nullable that it bubble up null to its parent and so on. The non-nullable type is 'Entity' within parent type 'SearchResult'", 'path': ['scrollAcrossEntities', 'searchResults', 7, 'entity'], 'extensions': {'classification': 'NullValueInNonNullableField'}}, {'message': "The field at path '/scrollAcrossEntities/searchResults[8]/entity' was declared as a non null type, but the code involved in retrieving data has wrongly returned a null value.  The graphql specification requires that the parent field be set to null, or if that is non nullable that it bubble up null to its parent and so on. The non-nullable type is 'Entity' within parent type 'SearchResult'", 'path': ['scrollAcrossEntities', 'searchResults', 8, 'entity'], 'extensions': {'classification': 'NullValueInNonNullableField'}}, {'message': "The field at path '/scrollAcrossEntities/searchResults[9]/entity' was declared as a non null type, but the code involved in retrieving data has wrongly returned a null value.  The graphql specification requires that the parent field be set to null, or if that is non nullable that it bubble up null to its parent and so on. The non-nullable type is 'Entity' within parent type 'SearchResult'", 'path': ['scrollAcrossEntities', 'searchResults', 9, 'entity'], 'extensions': {'classification': 'NullValueInNonNullableField'}}, {'message': "The field at path '/scrollAcrossEntities/searchResults[10]/entity' was declared as a non null type, but the code involved in retrieving data has wrongly returned a null value.  The graphql specification requires that the parent field be set to null, or if that is non nullable that it bubble up null to its parent and so on. The non-nullable type is 'Entity' within parent type 'SearchResult'", 'path': ['scrollAcrossEntities', 'searchResults', 10, 'entity'], 'extensions': {'classification': 'NullValueInNonNullableField'}}, {'message': "The field at path '/scrollAcrossEntities/searchResults[11]/entity' was declared as a non null type, but the code involved in retrieving data has wrongly returned a null value.  The graphql specification requires that the parent field be set to null, or if that is non nullable that it bubble up null to its parent and so on. The non-nullable type is 'Entity' within parent type 'SearchResult'", 'path': ['scrollAcrossEntities', 'searchResults', 11, 'entity'], 'extensions': {'classification': 'NullValueInNonNullableField'}}, {'message': "The field at path '/scrollAcrossEntities/searchResults[12]/entity' was declared as a non null type, but the code involved in retrieving data has wrongly returned a null value.  The graphql specification requires that the parent field be set to null, or if that is non nullable that it bubble up null to its parent and so on. The non-nullable type is 'Entity' within parent type 'SearchResult'", 'path': ['scrollAcrossEntities', 'searchResults', 12, 'entity'], 'extensions': {'classification': 'NullValueInNonNullableField'}}, {'message': "The field at path '/scrollAcrossEntities/searchResults[13]/entity' was declared as a non null type, but the code involved in retrieving data has wrongly returned a null value.  The graphql specification requires that the parent field be set to null, or if that is non nullable that it bubble up null to its parent and so on. The non-nullable type is 'Entity' within parent type 'SearchResult'", 'path': ['scrollAcrossEntities', 'searchResults', 13, 'entity'], 'extensions': {'classification': 'NullValueInNonNullableField'}}]

Expected behavior Step 3 from the section above should find and delete relevant DataProcessInstance objects.

Screenshots

Screenshot 2024-05-18 at 19 05 36

Desktop (please complete the following information):

Additional context We tried to find some workaround for this problem by providing additional arguments or using GraphQL directly but got no luck. Here is a thread from Datahub's slack: https://datahubspace.slack.com/archives/C029A3M079U/p1715184338329459

github-actions[bot] commented 2 weeks ago

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io