datahub-project / datahub

The Metadata Platform for your Data Stack
https://datahubproject.io
Apache License 2.0
9.63k stars 2.85k forks source link

Self-Signed Certificate Verification Error #10783

Open craigbosco opened 2 months ago

craigbosco commented 2 months ago

Describe the bug I am trying to add a Tableau server Ingestion source. The server is hosted on an internal network and utilizes HTTPS with a self-signed certificate. The DataHub instance is the Quickstart container running locally.

The flow will provide the following error:

Unable to login (check your Tableau connection and credentials): HTTPSConnectionPool(host='${TABLEAU_HOST}', port=443): Max retries exceeded with url: /api/2.4/auth/signin (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1007)')))

To Reproduce Flow formula:

source:
    type: tableau
    config:
        connect_uri: 'https://${TABLEAU_HOST}'
        stateful_ingestion:
            enabled: true
        ingest_owner: true
        ingest_tags: true
        username: cbosco
        password: '${tableau_cbosco}'
sink:
    type: datahub-rest
    config:
        server: 'http://datahub-gms:8080'
  1. Go to Ingestion
  2. Create source from formula
  3. Execute

Expected behavior A clear and concise description of what you expected to happen.

Desktop (please complete the following information):

Additional context Full logs are here:


Execution finished with errors.
{'exec_id': 'f595a55a-0169-4c3e-9864-639debce590e',
 'infos': ['2024-06-26 14:57:20.044957 INFO: Starting execution for task with name=RUN_INGEST',
           "2024-06-26 14:57:30.118957 INFO: Failed to execute 'datahub ingest', exit code 1",
           '2024-06-26 14:57:30.119043 INFO: Caught exception EXECUTING task_id=f595a55a-0169-4c3e-9864-639debce590e, name=RUN_INGEST, '
           'stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 140, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 282, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"],
 'errors': []}

~~~~ Ingestion Report ~~~~
{
  "cli": {
    "cli_version": "0.13.3rc1",
    "cli_entry_location": "/tmp/datahub/ingest/venv-tableau-03575587e416950c/lib/python3.10/site-packages/datahub/__init__.py",
    "models_version": "bundled",
    "py_version": "3.10.13 (main, Jan 17 2024, 05:40:33) [GCC 12.2.0]",
    "py_exec_path": "/tmp/datahub/ingest/venv-tableau-03575587e416950c/bin/python3",
    "os_details": "Linux-6.4.16-linuxkit-aarch64-with-glibc2.36",
    "mem_info": "72.92 MB",
    "peak_memory_usage": "72.92 MB",
    "disk_info": {
      "total": "62.67 GB",
      "used": "24.88 GB",
      "used_initally": "24.88 GB",
      "free": "34.58 GB"
    },
    "peak_disk_usage": "24.88 GB",
    "thread_count": 1,
    "peak_thread_count": 1
  },
  "source": {
    "type": "tableau",
    "report": {
      "events_produced": 0,
      "events_produced_per_sec": 0,
      "entities": {},
      "aspects": {},
      "aspect_urn_samples": {},
      "warnings": {},
      "failures": {
        "tableau-login": [
          "Unable to login (check your Tableau connection and credentials): HTTPSConnectionPool(host='${TABLEAU_HOST}', port=443): Max retries exceeded with url: /api/2.4/auth/signin (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1007)')))"
        ]
      },
      "soft_deleted_stale_entities": [],
      "start_time": "2024-06-26 14:57:21.563443 (6.6 seconds ago)",
      "running_time": "6.6 seconds"
    }
  },
  "sink": {
    "type": "datahub-rest",
    "report": {
      "total_records_written": 0,
      "records_written_per_second": 0,
      "warnings": [],
      "failures": [],
      "start_time": "2024-06-26 14:57:21.446230 (6.72 seconds ago)",
      "current_time": "2024-06-26 14:57:28.161440 (now)",
      "total_duration_in_seconds": 6.72,
      "max_threads": 15,
      "gms_version": "v0.13.3rc1",
      "pending_requests": 0
    }
  }
}

~~~~ Ingestion Logs ~~~~
Obtaining venv creation lock...
Acquired venv creation lock
venv is already set up
venv setup time = 0 sec
This version of datahub supports report-to functionality
+ exec datahub ingest run -c /tmp/datahub/ingest/f595a55a-0169-4c3e-9864-639debce590e/recipe.yml --report-to /tmp/datahub/ingest/f595a55a-0169-4c3e-9864-639debce590e/ingestion_report.json
[2024-06-26 14:57:21,442] INFO     {datahub.cli.ingest_cli:147} - DataHub CLI version: 0.13.3rc1
[2024-06-26 14:57:21,448] INFO     {datahub.ingestion.run.pipeline:254} - Sink configured successfully. DataHubRestEmitter: configured to talk to http://datahub-gms:8080
[2024-06-26 14:57:21,683] INFO     {tableauserverclient.server.server:178} - Could not get version info from server: <class 'requests.exceptions.SSLError'>HTTPSConnectionPool(host='${TABLEAU_HOST}', port=443): Max retries exceeded with url: /api/2.4/serverInfo (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1007)')))
[2024-06-26 14:57:21,684] INFO     {tableauserverclient.server.server:180} - versions: None, 2.4
[2024-06-26 14:57:28,153] ERROR    {datahub.ingestion.source.tableau:796} - tableau-login => Unable to login (check your Tableau connection and credentials): HTTPSConnectionPool(host='${TABLEAU_HOST}', port=443): Max retries exceeded with url: /api/2.4/auth/signin (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1007)')))
[2024-06-26 14:57:28,153] INFO     {datahub.ingestion.run.pipeline:276} - Source configured successfully.
[2024-06-26 14:57:28,153] INFO     {datahub.cli.ingest_cli:128} - Starting metadata ingestion
[2024-06-26 14:57:28,160] INFO     {datahub.ingestion.run.pipeline:529} - Processing commit request for DatahubIngestionCheckpointingProvider. Commit policy = CommitPolicy.ALWAYS, has_errors=True, has_warnings=False
[2024-06-26 14:57:28,160] WARNING  {datahub.ingestion.source.state_provider.datahub_ingestion_checkpointing_provider:95} - No state available to commit for DatahubIngestionCheckpointingProvider
[2024-06-26 14:57:28,160] INFO     {datahub.ingestion.run.pipeline:549} - Successfully committed changes for DatahubIngestionCheckpointingProvider.
[2024-06-26 14:57:28,161] INFO     {datahub.ingestion.reporting.file_reporter:54} - Wrote FAILURE report successfully to <_io.TextIOWrapper name='/tmp/datahub/ingest/f595a55a-0169-4c3e-9864-639debce590e/ingestion_report.json' mode='w' encoding='UTF-8'>
[2024-06-26 14:57:28,161] INFO     {datahub.cli.ingest_cli:141} - Finished metadata ingestion

Cli report:
{'cli_version': '0.13.3rc1',
 'cli_entry_location': '/tmp/datahub/ingest/venv-tableau-03575587e416950c/lib/python3.10/site-packages/datahub/__init__.py',
 'models_version': 'bundled',
 'py_version': '3.10.13 (main, Jan 17 2024, 05:40:33) [GCC 12.2.0]',
 'py_exec_path': '/tmp/datahub/ingest/venv-tableau-03575587e416950c/bin/python3',
 'os_details': 'Linux-6.4.16-linuxkit-aarch64-with-glibc2.36',
 'mem_info': '72.92 MB',
 'peak_memory_usage': '72.92 MB',
 'disk_info': {'total': '62.67 GB', 'used': '24.88 GB', 'used_initally': '24.88 GB', 'free': '34.57 GB'},
 'peak_disk_usage': '24.88 GB',
 'thread_count': 1,
 'peak_thread_count': 1}
Source (tableau) report:
{'events_produced': 0,
 'events_produced_per_sec': 0,
 'entities': {},
 'aspects': {},
 'aspect_urn_samples': {},
 'warnings': {},
 'failures': {'tableau-login': ["Unable to login (check your Tableau connection and credentials): HTTPSConnectionPool(host='${TABLEAU_HOST}', port=443): Max retries exceeded with url: /api/2.4/auth/signin (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1007)')))"]},
 'soft_deleted_stale_entities': [],
 'start_time': '2024-06-26 14:57:21.563443 (6.82 seconds ago)',
 'running_time': '6.82 seconds'}
Sink (datahub-rest) report:
{'total_records_written': 0,
 'records_written_per_second': 0,
 'warnings': [],
 'failures': [],
 'start_time': '2024-06-26 14:57:21.446230 (6.94 seconds ago)',
 'current_time': '2024-06-26 14:57:28.382283 (now)',
 'total_duration_in_seconds': 6.94,
 'max_threads': 15,
 'gms_version': 'v0.13.3rc1',
 'pending_requests': 0}

Pipeline finished with at least 1 failures; produced 0 events in 6.82 seconds.
craigbosco commented 2 months ago

I know that when I use tableauserverclient in Python, I have to add the following option to get the connection to work:

import tableauserverclient as TSC
server = TSC.Server(os.getenv("TABLEAU_SERVER"), use_server_version=True)
server.add_http_options({"verify": False})
hsheth2 commented 2 months ago

@craigbosco have you tried using the ssl_verify config? https://datahubproject.io/docs/generated/ingestion/sources/tableau/