datahub-project / datahub

The Metadata Platform for your Data Stack
https://datahubproject.io
Apache License 2.0
9.43k stars 2.8k forks source link

Ingestion for superset failed on LDAP #10566

Open fzhan opened 1 month ago

fzhan commented 1 month ago

Describe the bug Superset is setup with AzureAD only, tried to provide both username and login for AzureAD and had ldap as provider but failed with the follow message: datahub.ingestion.run.pipeline.PipelineInitError: Failed to configure the source (superset): 'access_token'

To Reproduce Steps to reproduce the behavior:

  1. Superset has been setup with AzureAD only access
  2. Config the ingestion with source: type: superset config: connect_uri: 'http://superset.data-platform:8088' display_uri: 'https://bi.company' username: user@company password: password-from-ad provider: ldap
  3. Run the ingestion
  4. See error
    [2024-05-22 02:56:36,069] DEBUG    {datahub.entrypoints:206} - Python version: 3.10.13 (main, Jan 17 2024, 06:53:56) [GCC 12.2.0] at /tmp/datahub/ingest/venv-superset-2b9c1ab97dc6cd7f/bin/python3 on Linux-5.15.0-94-generic-x86_64-with-glibc2.36
    [2024-05-22 02:56:36,069] DEBUG    {datahub.entrypoints:211} - GMS config {'models': {}, 'patchCapable': True, 'versions': {'acryldata/datahub': {'version': 'v0.13.2', 'commit': '0a8ec376b7c6963772a167e08837dce8b480af7c'}}, 'managedIngestion': {'defaultCliVersion': '0.13.1.2', 'enabled': True}, 'statefulIngestionCapable': True, 'supportsImpactAnalysis': True, 'timeZone': 'GMT', 'telemetry': {'enabledCli': True, 'enabledIngestion': False}, 'datasetUrnNameCasing': False, 'retention': 'true', 'datahub': {'serverType': 'prod'}, 'noCode': 'true'}
    [exec_id=abd62b81-d969-485e-b0cb-d17ca27cb888] 2024-05-22 03:59:28.099623 INFO: Starting execution for task with name=RUN_INGEST
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs] Obtaining venv creation lock...
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs] Acquired venv creation lock
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs] venv is already set up
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs] venv setup time = 0 sec
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs] This version of datahub supports report-to functionality
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs] + exec datahub --debug ingest run -c /tmp/datahub/ingest/abd62b81-d969-485e-b0cb-d17ca27cb888/recipe.yml --report-to /tmp/datahub/ingest/abd62b81-d969-485e-b0cb-d17ca27cb888/ingestion_report.json
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs] [2024-05-22 03:59:29,546] DEBUG    {datahub.telemetry.telemetry:286} - Sending init Telemetry
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs] [2024-05-22 03:59:30,217] DEBUG    {datahub.telemetry.telemetry:315} - Sending telemetry for function-call
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs] [2024-05-22 03:59:30,513] INFO     {datahub.cli.ingest_cli:147} - DataHub CLI version: 0.13.1.2
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs] [2024-05-22 03:59:30,589] DEBUG    {datahub.ingestion.sink.datahub_rest:111} - Setting env variables to override config
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs] [2024-05-22 03:59:30,590] DEBUG    {datahub.ingestion.sink.datahub_rest:113} - Setting gms config
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs] [2024-05-22 03:59:30,590] DEBUG    {datahub.ingestion.run.pipeline:238} - Sink type datahub-rest (<class 'datahub.ingestion.sink.datahub_rest.DatahubRestSink'>) configured
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs] [2024-05-22 03:59:30,590] INFO     {datahub.ingestion.run.pipeline:239} - Sink configured successfully. DataHubRestEmitter: configured to talk to http://datahub-datahub-gms:8080
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs] [2024-05-22 03:59:30,597] DEBUG    {datahub.ingestion.run.pipeline:313} - Reporter type:file,<class 'datahub.ingestion.reporting.file_reporter.FileReporter'> configured.
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs] [2024-05-22 03:59:30,620] INFO     {datahub.ingestion.source.state.stateful_ingestion_base:241} - Stateful ingestion will be automatically enabled, as datahub-rest sink is used or `datahub_api` is specified
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs] [2024-05-22 03:59:30,630] DEBUG    {datahub.ingestion.source.state.stateful_ingestion_base:286} - Successfully created datahub state provider.
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs] [2024-05-22 03:59:30,671] DEBUG    {datahub.telemetry.telemetry:315} - Sending telemetry for function-call
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs] [2024-05-22 03:59:30,975] ERROR    {datahub.entrypoints:201} - Command failed: Failed to configure the source (superset): 'access_token'
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs] Traceback (most recent call last):
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]   File "/tmp/datahub/ingest/venv-superset-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 121, in _add_init_error_context
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]     yield
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]   File "/tmp/datahub/ingest/venv-superset-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 252, in __init__
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]     self.source = source_class.create(
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]   File "/tmp/datahub/ingest/venv-superset-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/source/superset.py", line 220, in create
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]     return cls(ctx, config)
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]   File "/tmp/datahub/ingest/venv-superset-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/source/superset.py", line 193, in __init__
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]     self.access_token = login_response.json()["access_token"]
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs] KeyError: 'access_token'
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs] 
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs] The above exception was the direct cause of the following exception:
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs] 
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs] Traceback (most recent call last):
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]   File "/tmp/datahub/ingest/venv-superset-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/entrypoints.py", line 188, in main
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]     sys.exit(datahub(standalone_mode=False, **kwargs))
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]   File "/tmp/datahub/ingest/venv-superset-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]     return self.main(*args, **kwargs)
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]   File "/tmp/datahub/ingest/venv-superset-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/click/core.py", line 1078, in main
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]     rv = self.invoke(ctx)
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]   File "/tmp/datahub/ingest/venv-superset-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]     return _process_result(sub_ctx.command.invoke(sub_ctx))
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]   File "/tmp/datahub/ingest/venv-superset-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]     return _process_result(sub_ctx.command.invoke(sub_ctx))
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]   File "/tmp/datahub/ingest/venv-superset-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]     return ctx.invoke(self.callback, **ctx.params)
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]   File "/tmp/datahub/ingest/venv-superset-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]     return __callback(*args, **kwargs)
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]   File "/tmp/datahub/ingest/venv-superset-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 454, in wrapper
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]     raise e
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]   File "/tmp/datahub/ingest/venv-superset-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 403, in wrapper
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]     res = func(*args, **kwargs)
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]   File "/tmp/datahub/ingest/venv-superset-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 201, in run
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]     ret = loop.run_until_complete(run_ingestion_and_check_upgrade())
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]   File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]     return future.result()
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]   File "/tmp/datahub/ingest/venv-superset-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 170, in run_ingestion_and_check_upgrade
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]     pipeline = Pipeline.create(
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]   File "/tmp/datahub/ingest/venv-superset-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 363, in create
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]     return cls(
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]   File "/tmp/datahub/ingest/venv-superset-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 251, in __init__
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]     with _add_init_error_context(f"configure the source ({source_type})"):
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]   File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]     self.gen.throw(typ, value, traceback)
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]   File "/tmp/datahub/ingest/venv-superset-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 123, in _add_init_error_context
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs]     raise PipelineInitError(f"Failed to {step}: {e}") from e
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs] datahub.ingestion.run.pipeline.PipelineInitError: Failed to configure the source (superset): 'access_token'
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs] [2024-05-22 03:59:30,978] DEBUG    {datahub.entrypoints:203} - DataHub CLI version: 0.13.1.2 at /tmp/datahub/ingest/venv-superset-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/__init__.py
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs] [2024-05-22 03:59:30,978] DEBUG    {datahub.entrypoints:206} - Python version: 3.10.13 (main, Jan 17 2024, 06:53:56) [GCC 12.2.0] at /tmp/datahub/ingest/venv-superset-2b9c1ab97dc6cd7f/bin/python3 on Linux-5.15.0-94-generic-x86_64-with-glibc2.36
    [abd62b81-d969-485e-b0cb-d17ca27cb888 logs] [2024-05-22 03:59:30,978] DEBUG    {datahub.entrypoints:211} -

Expected behavior Provide suggestions for Superset ingestion with ldap enabled as only access

Desktop (please complete the following information):

Additional context Datahub is running on a local k8s cluster, with superset in another namespace.

fzhan commented 1 month ago

There's a piece of code written for superset to take thumbnail images which bypass the authentication, perhaps we can leverage the same code?

    from superset.utils.urls import headless_url
    from superset.utils.machine_auth import MachineAuthProvider

    def auth_driver(driver, user):
        # Setting cookies requires doing a request first, but /login is redirected to oauth provider, and stuck there.
        driver.get(headless_url("/doesnotexist"))

        cookies = MachineAuthProvider.get_auth_cookies(user)

        for cookie_name, cookie_val in cookies.items():
            driver.add_cookie(dict(name=cookie_name, value=cookie_val))

        return driver
github-actions[bot] commented 1 week ago

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io