datahub-project / datahub

The Metadata Platform for your Data Stack
https://datahubproject.io
Apache License 2.0
9.87k stars 2.92k forks source link

Qlik Cloud ingestion not working for dashboards and sheets #11355

Open SindreKjetsa opened 1 month ago

SindreKjetsa commented 1 month ago

Describe the bug Ingestion of qlik cloud data is not working properly. It is ingesting the spaces and a few text files, but not dashboards or sheets.

To Reproduce Steps to reproduce the behavior: Ingest qlik cloud Default recipe with a datahub-gms sink and token from a qlik user that is tenant admin. source: type: qlik-sense config: tenant_hostname: hostname.eu.qlikcloud.com api_key: '${qlik_api_token}' ingest_owner: true pipeline_name: qlik_sense_ingestion_pipeline sink: type: datahub-rest config: server: 'http://datahub-gms:8080'

Expected behavior Dashboards and sheets should also be ingested into datahub

Screenshots

Desktop (please complete the following information):

Additional context Error message from datahub-actions: It ingests 57 spaces, and then ends.

[2024-09-11 10:49:36,108] INFO {datahub.ingestion.run.pipeline:296} - Source configured successfully. [2024-09-11 10:49:36,109] INFO {datahub.cli.ingest_cli:130} - Starting metadata ingestion [2024-09-11 10:49:36,109] INFO {datahub.ingestion.source.qlik_sense.qlik_sense:602} - Qlik Sense plugin execution is started [2024-09-11 10:49:36,780] WARNING {datahub.ingestion.source.qlik_sense.qlik_api:43} - Unable to fetch spaces. Exception: 1 validation error for Space root time data '2024-06-10T11:49:01Z' does not match format '%Y-%m-%dT%H:%M:%S.%fZ' (type=value_error) [2024-09-11 10:49:36,781] INFO {datahub.ingestion.source.qlik_sense.qlik_sense:180} - Number of spaces = 57 [2024-09-11 10:49:36,781] INFO {datahub.ingestion.source.qlik_sense.qlik_sense:182} - Number of allowed spaces = 57 [2024-09-11 10:49:37,436] WARNING {datahub.ingestion.source.qlik_sense.qlik_api:43} - Unable to fetch items. Exception: 'personal-space-id' [2024-09-11 10:49:37,444] INFO {datahub.ingestion.run.pipeline:570} - Processing commit request for DatahubIngestionCheckpointingProvider. Commit policy = CommitPolicy.ALWAYS, has_errors=False, has_warnings=False [2024-09-11 10:49:37,474] WARNING {datahub.ingestion.source.state_provider.datahub_ingestion_checkpointing_provider:95} - No state available to commit for DatahubIngestionCheckpointingProvider [2024-09-11 10:49:37,497] INFO {datahub.ingestion.run.pipeline:590} - Successfully committed changes for DatahubIngestionCheckpointingProvider. [2024-09-11 10:49:37,520] INFO {datahub.ingestion.reporting.file_reporter:54} - Wrote SUCCESS report successfully to <_io.TextIOWrapper name='/tmp/datahub/ingest/0cdf09e9-f0ff-4096-8da5-5033f49f75f7/ingestion_report.json' mode='w' encoding='UTF-8'> [2024-09-11 10:49:38,666] INFO {datahub.cli.ingest_cli:143} - Finished metadata ingestion

henningwold commented 1 month ago

We have been debugging this for a little while. It turns out the JSON for one of the spaces looks like this (some irrelevant info redacted)

{
      "id": "idhash",
      "type": "managed",
      "ownerId": "ownerhash",
      "tenantId": "tenantId",
      "name": "Name",
      "description": "Description",
      "meta": {
        "actions": ["change_owner", "create", "delete", "read", "update"],
        "roles": [],
        "assignableRoles": [
          "basicconsumer",
          "consumer",
          "contributor",
          "dataconsumer",
          "facilitator",
          "operator",
          "publisher"
        ]
      },
      "links": {
        "self": {
          "href": "https://ourworkspace.eu.qlikcloud.com/api/v1/spaces/spaceid"
        },
        "assignments": {
          "href": "https://ourworkspace.eu.qlikcloud.com/api/v1/spaces/spaceid/assignments"
        }
      },
      "createdAt": "2023-09-28T11:14:51.92Z",
      "createdBy": "createdbyhash",
      "updatedAt": "2024-06-10T11:49:01Z"
    }

Note that somehow the "updatedAt" field is missing milliseconds which actually breaks the parsing at https://github.com/datahub-project/datahub/blob/c3e53a110160631f7b4948f385a9a90ed094467b/metadata-ingestion/src/datahub/ingestion/source/qlik_sense/data_classes.py#L94 since it only accepts an exact format that has to contain milliseconds. Whether this is a bug in Qlik or Datahub, however, I am unsure about.

henningwold commented 1 month ago

As an added bonus, the code at https://github.com/datahub-project/datahub/blob/c3e53a110160631f7b4948f385a9a90ed094467b/metadata-ingestion/src/datahub/ingestion/source/qlik_sense/qlik_api.py#L67 adds the personal space after all other spaces have been added, which leads to this code also terminating as soon as it encounters data belonging in any personal space, as it was never added to the spaces holder in the first time (since the fetch there caught an exception before it was added).

henningwold commented 1 month ago

For the record: we seem to have (at least temporarily) solved the issue by changing the description back and forth to touch the updatedAt field, but this would naturally have been a lot more difficult to do if the rogue timestamp had instead been for the createdAt field.

hsheth2 commented 1 month ago

Looks like we should probably be a bit more lenient with our timestamp parsing logic. I suspect that we should use something like dateutil.parser.parse instead of strictly parsing with a specific format.

We made a similar change for dbt https://github.com/datahub-project/datahub/pull/10223 - @henningwold would you be open to creating a PR to fix that code in the Qlik source?