datahub-project / datahub

The Metadata Platform for your Data and AI Stack
https://datahubproject.io
Apache License 2.0
10k stars 2.96k forks source link

AWS Glue no longer works as schema registry option #8726

Closed jelledv closed 1 year ago

jelledv commented 1 year ago

Describe the bug After upgrading to the latest Datahub version (0.10.5) from version 0.10.1 we cannot get the GMS backend up and running when using AWS_GLUE as the kafka schema registry option. I get the error:

2023-08-17 13:18:57,323 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:71 - Failed to connect to open servlet: Connect to localhost:8081 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
2023-08-17 13:18:57,324 [pool-15-thread-1] ERROR c.l.m.boot.OnBootApplicationListener:76 - Failed to bootstrap DataHub, OpenAPI servlet was not ready after 30 seconds

You can see that the backend component tries to connect to localhost:8081, which is the port of the Confluent schema registry.

It looks like the "isSchemaRegistryAPIServeletReady" is getting started with any spring event originating from the WebApplicationContext. You can see this happening in the screenshot from the logs of the GMS component, and also in the code https://github.com/datahub-project/datahub/blob/master/metadata-service/factories/[…]/java/com/linkedin/metadata/boot/OnBootApplicationListener.java The check is getting initialized, even before the "schemaRegistryServlet" is initialised. I also don't get why this schemaRegistryServlet bean is getting registered , it has the condition @ConditionalOnProperty(name = "kafka.schemaRegistry.type", havingValue = InternalSchemaRegistryFactory.TYPE) on it, while I have

- name: KAFKA_SCHEMAREGISTRY_TYPE
  value: AWS_GLUE as environment variable

The Datahub system update job, as well as the MAE and MCE consumers were also failing initially because the Glue configuration in the values.yml was commented out:

global:
  kafka:
    bootstrap:
      server: ${aws_msk_cluster.datahub.bootstrap_brokers}
    zookeeper:
      server: ${aws_msk_cluster.datahub.zookeeper_connect_string}
    partitions: 3
    replicationFactor: 2
    schemaregistry:
      type: AWS_GLUE
      glue:
        region: ${var.aws_region}
        registry: ${aws_glue_registry.kafka.registry_name}

We could fix those 3 components by providing the Spring configuration property directly with environment variables:

  extraEnvs: #Workaround for bug: https://datahubspace.slack.com/archives/CV2UVAPPG/p1690545008697309
    - name: KAFKA_SCHEMAREGISTRY_AWSGLUE_REGISTRYNAME
      value: ${aws_glue_registry.kafka.registry_name}
    - name: KAFKA_SCHEMAREGISTRY_AWSGLUE_REGION
      value: ${var.aws_region}

However, the above workaround did not work for the GMS backend component.

It looks like the AWS Glue Schema registry option was commented out in the application.yml properties file: link to code

To Reproduce Steps to reproduce the behavior:

  1. Deploy Datahub helm chart with a version above v0.10.3
  2. Use AWS_GLUE as schema registry
  3. Use the extraEnvs workaround to prevent the error in most components
  4. See error in GMS component

Expected behavior GMS backend running correctly

Screenshots Here is a screenshot where I see the "isSchemaRegistryAPIServeletReady" getting called, despite not using the "INTERNAL" schema registry option. I think this is not correct. image

Additional context helm chart version: 0.2.181

shicholas commented 1 year ago

I think I’m experiencing the same error, comments on Slack indicate we might need to use the Confluent one

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 30 days since being marked as stale.