datahub-project / datahub

The Metadata Platform for your Data and AI Stack
https://datahubproject.io
Apache License 2.0
9.91k stars 2.94k forks source link

SchemaFieldEntities seem to cause performance issues #11417

Closed guentherhackl-wgs closed 1 month ago

guentherhackl-wgs commented 1 month ago

Describe the bug The symptom of the issue was that the schema of a kafka topic would not load and after ~30s it dispayed "No Data" for the schema although the schemaMetadataAspect was visible in the database.

In the GMS Logs I could see that the graphQL query to retrieve the schemaMetadata caused an async timeout (and still finished after 200s)

2024-09-18 12:40:20,823 [qtp1213216872-8203] INFO  c.datahub.graphql.GraphQLController:125 - Processing request, operation: getSchemaBlame, actor urn: urn:li:corpuser:guenther.hackl
2024-09-18 12:40:20,824 [ForkJoinPool.commonPool-worker-4154] INFO  c.datahub.graphql.GraphQLController:130 - Executing operation getSchemaBlame for qtp1213216872-8203
2024-09-18 12:40:20,847 [ForkJoinPool.commonPool-worker-4159] INFO  c.datahub.graphql.GraphQLController:153 - Executed operation getVersionedDataset in 26 ms
2024-09-18 12:40:20,849 [ForkJoinPool.commonPool-worker-4159] INFO  c.datahub.graphql.GraphQLController:158 - Operation getVersionedDataset execution result size: 90772
2024-09-18 12:40:20,874 [I/O dispatcher 1] INFO  c.l.m.s.e.update.BulkListener:61 - Successfully fed bulk request 3341. Number of events: 6 Took time ms: 12
2024-09-18 12:40:21,033 [ForkJoinPool.commonPool-worker-4154] INFO  c.datahub.graphql.GraphQLController:153 - Executed operation getSchemaBlame in 208 ms
2024-09-18 12:40:21,034 [ForkJoinPool.commonPool-worker-4154] INFO  c.datahub.graphql.GraphQLController:158 - Operation getSchemaBlame execution result size: 101855
2024-09-18 12:40:50,507 [qtp1213216872-8294] WARN  o.s.w.s.m.s.DefaultHandlerExceptionResolver:247 - Resolved [org.springframework.web.context.request.async.AsyncRequestTimeoutException]
2024-09-18 12:42:40,231 [ForkJoinPool.commonPool-worker-22] INFO  c.datahub.graphql.GraphQLController:153 - Executed operation getDatasetSchema in 100594 ms
2024-09-18 12:42:40,280 [ForkJoinPool.commonPool-worker-22] INFO  c.datahub.graphql.GraphQLController:158 - Operation getDatasetSchema execution result size: 226375

No resource restrictions showed up on any graph in our graphana. There were seemingly enough cpu and ram for GMS and OpenSearch. THe times for the other request don't look that bad to me.

I've gone through the graphql query and deleted parts to narrow down the issue and was able to find the following minimal query to produce it in my setup, which indicated the involvement of the SchemaFieldEntities:

query getDatasetSchema($urn: String!) {
  dataset(urn: $urn) {
    ...datasetSchema
  }
}

fragment datasetSchema on Dataset {
  schemaMetadata(version: 0) {
    ...schemaMetadataFields
  }
}

fragment schemaMetadataFields on SchemaMetadata {
  fields {
    ...schemaFieldFields
  }
}  

fragment schemaFieldFields on SchemaField {

  schemaFieldEntity {
    urn
    type
  }
}

After setting the feature flag for SchemaFieldEntities to false (SCHEMA_FIELD_ENTITY_FETCH_ENABLED=false) it significantly improved.

There seem to be two factors which have an effect on this:

To Reproduce Just loading the affected large schema to a local datahub did not produce the issue. The number of entities seem to be important to. I have not yet tried it, but I would assume to reproduce a significant amount of datasets is needed locally

Expected behavior Show the schema when it is verifiably in the database

Desktop (please complete the following information):

david-leifker commented 1 month ago

We'll get a PR up to improve the performance of the graphql call. There has been at least one possible mitigation to improve performance in this case to reduce redundant access checks in #11327 part of v0.14.1.

david-leifker commented 1 month ago

https://github.com/datahub-project/datahub/pull/11434 should hopefully fix this issue.