datahub-project / datahub

The Metadata Platform for your Data and AI Stack
https://datahubproject.io
Apache License 2.0
9.94k stars 2.95k forks source link

datahub performance #11671

Open pilipyukaaa opened 1 month ago

pilipyukaaa commented 1 month ago

Hello, I have a problem with performance on process which consume messages from kafka and push changes in elasticsearch and neo4j i was added this envs to my gms

  extraEnvs:
    - name: SPRING_KAFKA_PROPERTIES_MAX_POLL_RECORDS
      value: '10'
    - name: SPRING_KAFKA_PROPERTIES_MAX_POLL_INTERVAL_MS
      value: '120000'
    - name: ES_BULK_REQUESTS_LIMIT
      value: '1500'
    - name: ES_BULK_FLUSH_PERIOD
      value: '2'
    - name: LOGGING_LEVEL_ORG_APACHE_KAFKA_CLIENTS_CONSUMER
      value: DEBUG
    - name: LOGGING_LEVEL_ORG_SPRINGFRAMEWORK_KAFKA
      value: DEBUG
    - name: ELASTICSEARCH_THREAD_COUNT
      value: '15'
    - name: ES_BULK_ENABLE_BATCH_DELETE
      value: 'true'
    - name: LOGGING_LEVEL_ORG_APACHE_KAFKA_CLIENTS_CONSUMER
      value: DEBUG
    - name: LOGGING_LEVEL_ORG_SPRINGFRAMEWORK_KAFKA
      value: DEBUG
[2024-10-18 09:01:22,092 [I/O dispatcher 1] INFO  c.l.m.s.e.update.BulkListener:61 - Successfully fed bulk request 172. Number of events: 5 Took time ms: 3
2024-10-18 09:01:40,463 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.MetadataChangeLogProcessor:119 - Invoking MCL hook IncidentsSummaryHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_opportunity_view_final,PROD)
2024-10-18 09:01:40,463 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.MetadataChangeLogProcessor:119 - Invoking MCL hook IngestionSchedulerHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_opportunity_view_final,PROD)
2024-10-18 09:01:40,463 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.MetadataChangeLogProcessor:119 - Invoking MCL hook EntityChangeEventGeneratorHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_opportunity_view_final,PROD)
2024-10-18 09:01:40,463 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.MetadataChangeLogProcessor:119 - Invoking MCL hook SiblingAssociationHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_opportunity_view_final,PROD)
2024-10-18 09:01:40,463 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.h.s.SiblingAssociationHook:109 - Urn urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_opportunity_view_final,PROD) with aspect upstreamLineage received by Sibling Hook.
2024-10-18 09:01:40,467 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.h.s.SiblingAssociationHook:244 - Associating urn:li:dataset:(urn:li:dataPlatform:dbt,_bdm.bdm_dim_opportunity_view_final,PROD) and urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_opportunity_view_final,PROD) as siblings.
2024-10-18 09:01:40,473 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.MetadataChangeLogProcessor:119 - Invoking MCL hook FormAssignmentHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_opportunity_view_final,PROD)
2024-10-18 09:01:40,473 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.MetadataChangeLogProcessor:137 - Successfully completed MCL hooks for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_opportunity_view_final,PROD)
2024-10-18 09:01:40,473 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.MetadataChangeLogProcessor:82 - Got MCL event key: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_request,PROD), topic: MetadataChangeLog_Versioned_v1, partition: 0, offset: 119678, value size: 143224, timestamp: 1729168196437
2024-10-18 09:01:40,473 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.MetadataChangeLogProcessor:106 - Invoking MCL hooks for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_request,PROD), aspect name: upstreamLineage, entity type: dataset, change type: UPSERT
2024-10-18 09:01:40,474 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.MetadataChangeLogProcessor:119 - Invoking MCL hook UpdateIndicesHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_request,PROD)
2024-10-18 09:01:40,479 [ThreadPoolTaskExecutor-1] INFO  c.l.m.s.e.update.ESBulkProcessor:82 - Added request id: EtcUX9vACyZAw/dPG+Inzw==, operation type: UPDATE, index: system_metadata_service_v1
2024-10-18 09:02:04,472 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.MetadataChangeLogProcessor:119 - Invoking MCL hook IncidentsSummaryHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_request,PROD)
2024-10-18 09:02:04,473 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.MetadataChangeLogProcessor:119 - Invoking MCL hook IngestionSchedulerHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_request,PROD)
2024-10-18 09:02:04,473 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.MetadataChangeLogProcessor:119 - Invoking MCL hook EntityChangeEventGeneratorHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_request,PROD)
2024-10-18 09:02:04,473 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.MetadataChangeLogProcessor:119 - Invoking MCL hook SiblingAssociationHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_request,PROD)
2024-10-18 09:02:04,473 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.h.s.SiblingAssociationHook:109 - Urn urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_request,PROD) with aspect upstreamLineage received by Sibling Hook.
2024-10-18 09:02:04,476 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.h.s.SiblingAssociationHook:244 - Associating urn:li:dataset:(urn:li:dataPlatform:dbt,_bdm.bdm_dim_request,PROD) and urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_request,PROD) as siblings.
2024-10-18 09:02:04,481 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.MetadataChangeLogProcessor:119 - Invoking MCL hook FormAssignmentHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_request,PROD)
2024-10-18 09:02:04,481 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.MetadataChangeLogProcessor:137 - Successfully completed MCL hooks for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_request,PROD)
](url)

but performance is very low, can you help me find bottleneck?

deepgarg-visa commented 1 month ago

Hi @pilipyukaaa , which version of Datahub you are using ?

Neo4j is certainly a bottleneck here. PRs to improve neo4j query performances. Check your version has these changes.

https://github.com/datahub-project/datahub/pull/10598/files https://github.com/datahub-project/datahub/pull/10593

Also create indexes for entities in neo4j if not created already. By default they are not getting created.

Daniellundin048 commented 1 month ago

im not interesst

Den sön 20 okt. 2024 3:39 PMdeepgarg-visa @.***> skrev:

Hi @pilipyukaaa https://github.com/pilipyukaaa , which version of Datahub you are using ?

— Reply to this email directly, view it on GitHub https://github.com/datahub-project/datahub/issues/11671#issuecomment-2424966781, or unsubscribe https://github.com/notifications/unsubscribe-auth/BLQW5VENVKYX2IPQH2TKE6LZ4OXCJAVCNFSM6AAAAABQF3O5BGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMRUHE3DMNZYGE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

pilipyukaaa commented 1 month ago

hello, @deepgarg-visa i am using datahub version 0.13.3

pilipyukaaa commented 1 month ago

i was update my datahub to 0.14.1 version and its still not good

2024-10-21 13:24:50,425 [ThreadPoolTaskExecutor-1] INFO  c.l.metadata.kafka.MCLKafkaListener:79 - Invoking MCL hook IngestionSchedulerHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_dds.dist_5_dds_crm_issues,PROD)
2024-10-21 13:24:50,425 [ThreadPoolTaskExecutor-1] INFO  c.l.metadata.kafka.MCLKafkaListener:79 - Invoking MCL hook EntityChangeEventGeneratorHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_dds.dist_5_dds_crm_issues,PROD)
2024-10-21 13:24:50,425 [ThreadPoolTaskExecutor-1] INFO  c.l.metadata.kafka.MCLKafkaListener:79 - Invoking MCL hook SiblingAssociationHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_dds.dist_5_dds_crm_issues,PROD)
2024-10-21 13:24:50,425 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.h.s.SiblingAssociationHook:121 - Urn urn:li:dataset:(urn:li:dataPlatform:clickhouse,_dds.dist_5_dds_crm_issues,PROD) with aspect datasetKey received by Sibling Hook.
2024-10-21 13:24:50,433 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.h.s.SiblingAssociationHook:256 - Associating urn:li:dataset:(urn:li:dataPlatform:dbt,_dds.dist_5_dds_CRM_issues,PROD) and urn:li:dataset:(urn:li:dataPlatform:clickhouse,_dds.dist_5_dds_crm_issues,PROD) as siblings.
2024-10-21 13:24:50,438 [ThreadPoolTaskExecutor-1] INFO  c.l.metadata.kafka.MCLKafkaListener:97 - Successfully completed MCL hooks for consumer: generic-mae-consumer-job-client urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_dds.dist_5_dds_crm_issues,PROD)
2024-10-21 13:24:50,439 [ThreadPoolTaskExecutor-1] INFO  c.l.metadata.kafka.MCLKafkaListener:69 - Invoking MCL hooks for consumer: generic-mae-consumer-job-client urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_dds.dist_5_dds_crm_issues,PROD), aspect name: siblings, entity type: dataset, change type: RESTATE
2024-10-21 13:24:50,439 [ThreadPoolTaskExecutor-1] INFO  c.l.metadata.kafka.MCLKafkaListener:79 - Invoking MCL hook FormAssignmentHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_dds.dist_5_dds_crm_issues,PROD)
2024-10-21 13:24:50,439 [ThreadPoolTaskExecutor-1] INFO  c.l.metadata.kafka.MCLKafkaListener:79 - Invoking MCL hook UpdateIndicesHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_dds.dist_5_dds_crm_issues,PROD)
2024-10-21 13:24:50,440 [ThreadPoolTaskExecutor-1] INFO  c.l.m.s.e.update.ESBulkProcessor:82 - Added request id: BXh3SoWBZKt7JlYWQUbs+w==, operation type: UPDATE, index: system_metadata_service_v1
2024-10-21 13:24:50,441 [ThreadPoolTaskExecutor-1] INFO  c.l.m.s.e.update.ESBulkProcessor:82 - Added request id: urn%3Ali%3Adataset%3A%28urn%3Ali%3AdataPlatform%3Aclickhouse%2C_dds.dist_5_dds_crm_issues%2CPROD%29, operation type: UPDATE, index: datasetindex_v2
2024-10-21 13:24:50,473 [ThreadPoolTaskExecutor-1] INFO  c.l.metadata.kafka.MCLKafkaListener:79 - Invoking MCL hook IncidentsSummaryHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_dds.dist_5_dds_crm_issues,PROD)
2024-10-21 13:24:50,474 [ThreadPoolTaskExecutor-1] INFO  c.l.metadata.kafka.MCLKafkaListener:79 - Invoking MCL hook IngestionSchedulerHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_dds.dist_5_dds_crm_issues,PROD)
2024-10-21 13:24:50,474 [ThreadPoolTaskExecutor-1] INFO  c.l.metadata.kafka.MCLKafkaListener:79 - Invoking MCL hook EntityChangeEventGeneratorHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_dds.dist_5_dds_crm_issues,PROD)
2024-10-21 13:24:50,474 [ThreadPoolTaskExecutor-1] INFO  c.l.metadata.kafka.MCLKafkaListener:79 - Invoking MCL hook SiblingAssociationHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_dds.dist_5_dds_crm_issues,PROD)
2024-10-21 13:24:50,474 [ThreadPoolTaskExecutor-1] INFO  c.l.metadata.kafka.MCLKafkaListener:97 - Successfully completed MCL hooks for consumer: generic-mae-consumer-job-client urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_dds.dist_5_dds_crm_issues,PROD)
2024-10-21 13:24:50,474 [ThreadPoolTaskExecutor-1] INFO  c.l.metadata.kafka.MCLKafkaListener:69 - Invoking MCL hooks for consumer: generic-mae-consumer-job-client urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_dds.dist_5_dds_crm_issues,PROD), aspect name: upstreamLineage, entity type: dataset, change type: RESTATE
2024-10-21 13:24:50,474 [ThreadPoolTaskExecutor-1] INFO  c.l.metadata.kafka.MCLKafkaListener:79 - Invoking MCL hook FormAssignmentHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_dds.dist_5_dds_crm_issues,PROD)
2024-10-21 13:24:50,474 [ThreadPoolTaskExecutor-1] INFO  c.l.metadata.kafka.MCLKafkaListener:79 - Invoking MCL hook UpdateIndicesHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_dds.dist_5_dds_crm_issues,PROD)
2024-10-21 13:24:50,478 [ThreadPoolTaskExecutor-1] INFO  c.l.m.s.e.update.ESBulkProcessor:82 - Added request id: YxtQxbiPZDnFzc4S31sl0A==, operation type: UPDATE, index: system_metadata_service_v1
2024-10-21 13:24:50,478 [ThreadPoolTaskExecutor-1] INFO  c.l.m.s.e.update.ESBulkProcessor:82 - Added request id: urn%3Ali%3Adataset%3A%28urn%3Ali%3AdataPlatform%3Aclickhouse%2C_dds.dist_5_dds_crm_issues%2CPROD%29, operation type: UPDATE, index: datasetindex_v2
2024-10-21 13:24:51,778 [I/O dispatcher 1] INFO  c.l.m.s.e.update.BulkListener:61 - Successfully fed bulk request 198. Number of events: 10 Took time ms: 10