datahub-project / datahub

The Metadata Platform for your Data and AI Stack
https://datahubproject.io
Apache License 2.0
9.89k stars 2.93k forks source link

Can datahub run in an offline environment? #7775

Closed ldshuai closed 1 year ago

ldshuai commented 1 year ago

Description

I downloaded all necessary docker images(https://datahubproject.io/docs/docker/) and imported them into my private offline environment. Follow the quickstart quide (https://datahubproject.io/docs/quickstart) I brought up all these images, and saw the output ✔ DataHub is now running. But when I ingested data from postgres via UI, something was wrong. The Last Status was aways Pending, and I got an error in container datahub-gms as below:

ERROR c.datahub.telemetry.TrackingService:105 - Failed to send event to Mixpanel
java.net.UnknownHostException: track.datahubproject.io
            at java...
            ...
            at com.mixpanel.mixpanelapi.MixpanelAPI.sendData(MixpanelAPI.java:134)
            at com.mixpanel.mixpanelapi.MixpanelAPI.sendMessage(MixpanelAPI.java:172)
            at com.mixpanel.mixpanelapi.MixpanelAPI.deliver(MixpanelAPI.java:103)
            at com.mixpanel.mixpanelapi.MixpanelAPI.deliver(MixpanelAPI.java:83)
            at com.mixpanel.mixpanelapi.MixpanelAPI.sendMessage(MixpanelAPI.java:71)
            at com.datahub.telemetry.TrackingService.emitAnalyticsEvent(TrackingService.java:103)
            at com.datahub.auth.authentication.AuthServiceController.lambda$track$4(AuthServiceController.java:340)
            ...

Steps

datahub CLI version: 0.10.0.6 datahub startup command:

datahub docker quickstart -f docker-compose-without-neo4j-m1.quickstart.yml

compose file: docker-compose-without-neo4j-m1.quickstart.yml in datahub-master/docker/quickstart/

networks:
  default:
    name: datahub_network
services:
  broker:
    container_name: broker
    depends_on:
    - zookeeper
    environment:
    - KAFKA_BROKER_ID=1
    - KAFKA_ZOOKEEPER_CONNECT=zookeeper:2181
    - KAFKA_LISTENER_SECURITY_PROTOCOL_MAP=PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
    - KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://broker:29092,PLAINTEXT_HOST://localhost:9092
    - KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR=1
    - KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS=0
    - KAFKA_HEAP_OPTS=-Xms256m -Xmx256m
    - KAFKA_CONFLUENT_SUPPORT_METRICS_ENABLE=false
    hostname: broker
    image: local-harbor/datahub/cp-kafka:6.0.13
    ports:
    - ${DATAHUB_MAPPED_KAFKA_BROKER_PORT:-9092}:9092
  datahub-frontend-react:
    container_name: datahub-frontend-react
    depends_on:
    - datahub-gms
    environment:
    - DATAHUB_GMS_HOST=datahub-gms
    - DATAHUB_GMS_PORT=8080
    - DATAHUB_SECRET=YouKnowNothing
    - DATAHUB_APP_VERSION=1.0
    - DATAHUB_PLAY_MEM_BUFFER_SIZE=10MB
    - JAVA_OPTS=-Xms512m -Xmx512m -Dhttp.port=9002 -Dconfig.file=datahub-frontend/conf/application.conf
      -Djava.security.auth.login.config=datahub-frontend/conf/jaas.conf -Dlogback.configurationFile=datahub-frontend/conf/logback.xml
      -Dlogback.debug=false -Dpidfile.path=/dev/null
    - KAFKA_BOOTSTRAP_SERVER=broker:29092
    - DATAHUB_TRACKING_TOPIC=DataHubUsageEvent_v1
    - ELASTIC_CLIENT_HOST=elasticsearch
    - ELASTIC_CLIENT_PORT=9200
    hostname: datahub-frontend-react
    image: local-harbor/datahub/datahub-frontend-react:v0.10.1
    ports:
    - ${DATAHUB_MAPPED_FRONTEND_PORT:-9002}:9002
    volumes:
    - ${HOME}/.datahub/plugins:/etc/datahub/plugins
  datahub-gms:
    container_name: datahub-gms
    depends_on:
    - mysql
    environment:
    - DATAHUB_SERVER_TYPE=${DATAHUB_SERVER_TYPE:-quickstart}
    - DATAHUB_TELEMETRY_ENABLED=${DATAHUB_TELEMETRY_ENABLED:-true}
    - DATAHUB_UPGRADE_HISTORY_KAFKA_CONSUMER_GROUP_ID=generic-duhe-consumer-job-client-gms
    - EBEAN_DATASOURCE_DRIVER=com.mysql.jdbc.Driver
    - EBEAN_DATASOURCE_HOST=mysql:3306
    - EBEAN_DATASOURCE_PASSWORD=datahub
    - EBEAN_DATASOURCE_URL=jdbc:mysql://mysql:3306/datahub?verifyServerCertificate=false&useSSL=true&useUnicode=yes&characterEncoding=UTF-8
    - EBEAN_DATASOURCE_USERNAME=datahub
    - ELASTICSEARCH_HOST=elasticsearch
    - ELASTICSEARCH_INDEX_BUILDER_MAPPINGS_REINDEX=true
    - ELASTICSEARCH_INDEX_BUILDER_SETTINGS_REINDEX=true
    - ELASTICSEARCH_PORT=9200
    - ENTITY_REGISTRY_CONFIG_PATH=/datahub/datahub-gms/resources/entity-registry.yml
    - ENTITY_SERVICE_ENABLE_RETENTION=true
    - ES_BULK_REFRESH_POLICY=WAIT_UNTIL
    - GRAPH_SERVICE_DIFF_MODE_ENABLED=true
    - GRAPH_SERVICE_IMPL=elasticsearch
    - JAVA_OPTS=-Xms1g -Xmx1g
    - KAFKA_BOOTSTRAP_SERVER=broker:29092
    - KAFKA_SCHEMAREGISTRY_URL=http://schema-registry:8081
    - MAE_CONSUMER_ENABLED=true
    - MCE_CONSUMER_ENABLED=true
    - PE_CONSUMER_ENABLED=true
    - UI_INGESTION_ENABLED=true
    hostname: datahub-gms
    image: local-harbor/datahub/datahub-gms:v0.10.1
    ports:
    - ${DATAHUB_MAPPED_GMS_PORT:-8080}:8080
    volumes:
    - ${HOME}/.datahub/plugins:/etc/datahub/plugins
  datahub-upgrade:
    labels:
      datahub_setup_job: true
    command:
    - -u
    - SystemUpdate
    container_name: datahub-upgrade
    environment:
    - EBEAN_DATASOURCE_USERNAME=datahub
    - EBEAN_DATASOURCE_PASSWORD=datahub
    - EBEAN_DATASOURCE_HOST=mysql:3306
    - EBEAN_DATASOURCE_URL=jdbc:mysql://mysql:3306/datahub?verifyServerCertificate=false&useSSL=true&useUnicode=yes&characterEncoding=UTF-8
    - EBEAN_DATASOURCE_DRIVER=com.mysql.jdbc.Driver
    - KAFKA_BOOTSTRAP_SERVER=broker:29092
    - KAFKA_SCHEMAREGISTRY_URL=http://schema-registry:8081
    - ELASTICSEARCH_HOST=elasticsearch
    - ELASTICSEARCH_PORT=9200
    - ELASTICSEARCH_INDEX_BUILDER_MAPPINGS_REINDEX=true
    - ELASTICSEARCH_INDEX_BUILDER_SETTINGS_REINDEX=true
    - ELASTICSEARCH_BUILD_INDICES_CLONE_INDICES=false
    - GRAPH_SERVICE_IMPL=elasticsearch
    - DATAHUB_GMS_HOST=datahub-gms
    - DATAHUB_GMS_PORT=8080
    - ENTITY_REGISTRY_CONFIG_PATH=/datahub/datahub-gms/resources/entity-registry.yml
    hostname: datahub-upgrade
    image: local-harbor/datahub/datahub-upgrade:v0.10.1
    labels:
      datahub_setup_job: true
  elasticsearch:
    container_name: elasticsearch
    environment:
    - discovery.type=single-node
    - xpack.security.enabled=false
    - ES_JAVA_OPTS=-Xms256m -Xmx512m -Dlog4j2.formatMsgNoLookups=true
    healthcheck:
      retries: 4
      start_period: 2m
      test:
      - CMD-SHELL
      - curl -sS --fail 'http://localhost:9200/_cluster/health?wait_for_status=yellow&timeout=0s'
        || exit 1
    hostname: elasticsearch
    image: local-harbor/datahub/elasticsearch:7.17.6
    mem_limit: 1g
    ports:
    - ${DATAHUB_MAPPED_ELASTIC_PORT:-9200}:9200
    volumes:
    - esdata:/usr/share/elasticsearch/data
  elasticsearch-setup:
    labels:
      datahub_setup_job: true
    container_name: elasticsearch-setup
    depends_on:
    - elasticsearch
    environment:
    - ELASTICSEARCH_HOST=elasticsearch
    - ELASTICSEARCH_PORT=9200
    - ELASTICSEARCH_PROTOCOL=http
    hostname: elasticsearch-setup
    image: local-harbor/datahub/datahub-elasticsearch-setup:v0.10.1
    labels:
      datahub_setup_job: true
  kafka-setup:
    labels:
      datahub_setup_job: true
    container_name: kafka-setup
    depends_on:
    - broker
    - schema-registry
    environment:
    - DATAHUB_PRECREATE_TOPICS=${DATAHUB_PRECREATE_TOPICS:-false}
    - KAFKA_BOOTSTRAP_SERVER=broker:29092
    - KAFKA_ZOOKEEPER_CONNECT=zookeeper:2181
    hostname: kafka-setup
    image: local-harbor/datahub/datahub-kafka-setup:v0.10.1
    labels:
      datahub_setup_job: true
  mysql:
    command: --character-set-server=utf8mb4 --collation-server=utf8mb4_bin --default-authentication-plugin=mysql_native_password
    container_name: mysql
    environment:
    - MYSQL_DATABASE=datahub
    - MYSQL_USER=datahub
    - MYSQL_PASSWORD=datahub
    - MYSQL_ROOT_PASSWORD=datahub
    hostname: mysql
    image: local-harbor/datahub/mariadb:10.5.8
    ports:
    - ${DATAHUB_MAPPED_MYSQL_PORT:-3306}:3306
    volumes:
    - ../mysql/init.sql:/docker-entrypoint-initdb.d/init.sql
    - mysqldata:/var/lib/mysql
  mysql-setup:
    labels:
      datahub_setup_job: true
    container_name: mysql-setup
    depends_on:
    - mysql
    environment:
    - MYSQL_HOST=mysql
    - MYSQL_PORT=3306
    - MYSQL_USERNAME=datahub
    - MYSQL_PASSWORD=datahub
    - DATAHUB_DB_NAME=datahub
    hostname: mysql-setup
    image: local-harbor/datahub/datahub-mysql-setup:v0.10.1
    labels:
      datahub_setup_job: true
  schema-registry:
    container_name: schema-registry
    depends_on:
    - broker
    environment:
    - SCHEMA_REGISTRY_HOST_NAME=schemaregistry
    - SCHEMA_REGISTRY_KAFKASTORE_SECURITY_PROTOCOL=PLAINTEXT
    - SCHEMA_REGISTRY_KAFKASTORE_BOOTSTRAP_SERVERS=broker:29092
    hostname: schema-registry
    image: local-harbor/datahub/cp-schema-registry:6.0.13
    ports:
    - ${DATAHUB_MAPPED_SCHEMA_REGISTRY_PORT:-8081}:8081
  zookeeper:
    container_name: zookeeper
    environment:
    - ZOOKEEPER_CLIENT_PORT=2181
    - ZOOKEEPER_TICK_TIME=2000
    hostname: zookeeper
    image: local-harbor/datahub/cp-zookeeper:6.0.13
    ports:
    - ${DATAHUB_MAPPED_ZK_PORT:-2181}:2181
    volumes:
    - zkdata:/var/lib/zookeeper
version: '2.3'
volumes:
  esdata: null
  mysqldata: null
  zkdata: null

Ingestion Source Courigure Recipe:

source:
    type: postgres
    config: 
        host_port: 'xxx:5432' 
        database: xxx
        username: postgres
        include_tables: true
        include_views: true
        profiling: 
            enabled: true
            profile_table_level_only: true
        stateful_ingestion: 
            enabled: true
        password: '${postgres_secret}'
        schema_pattern: 
            allow: 
                - public
        table_pattern: 
            allow: 
                - t_xxx
source:
    type: mysql
    config: 
        host_port: 'xxx:3306' 
        database: xxx
        username: root
        include_tables: true
        include_views: true
        profiling: 
            enabled: true
            profile_table_level_only: true
        stateful_ingestion: 
            enabled: true
        password: '${mysql_secret}'
        schema_pattern: 
            allow: 
                - public
        table_pattern: 
            allow: 
                - t_xxx
xiphl commented 1 year ago

refer to https://datahubproject.io/docs/deploy/telemetry/

ldshuai commented 1 year ago

Thanks for your reply. I have changed DATAHUB_TELEMETRY_ENABLED to false in the compose file and restarted datahub, but it still doesn't work with the Ingestion operation via UI(althought whihout the UnknownHostException). My ingestion source configuration recipe:

source:
    type: postgres
    config: 
        host_port: 'xxx:5432' 
        database: xxx
        username: postgres
        include_tables: true
        include_views: false
        profiling: 
            enabled: true
            profile_table_level_only: true
        stateful_ingestion: 
            enabled: false
        password: '${postgres_secret}'
        schema_pattern: 
            allow: 
                - public

But it worked well with the same configuration above when I ingested data from CLI. CLI command:

datahub ingest -c postgresql_to_datahub.dbub.yaml

postgresql_to_datahub.dbub.yaml file content:

source:
    type: "postgres"
    config: 
        host_port: xxx:5432
        database: xxx
        username: postgres
        password: postgresql
        include_tables: true
        include_views: false
        profiling: 
            enabled: true
            profile_table_level_only: true
        schema_pattern: 
            allow: 
                -  'public'
sink: 
    type: "datahub-rest"
    config: 
        server: "http://localhost:8080"

Are there any mistakes?

xiphl commented 1 year ago

i can't help much on UI ingestion as I never used it - been using CLI so far.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 30 days since being marked as stale.