apache / doris-operator

Doris kubernetes operator
Apache License 2.0
86 stars 46 forks source link

[Bug] doris fe not ready after reboot fe/be #290

Open ming12713 opened 1 week ago

ming12713 commented 1 week ago

Search before asking

Version

2.11

What's Wrong?

2024-11-12 08:05:14,341 INFO (stateListener|83) [DatabaseTransactionMgr.replayUpsertTransactionState():2158] replay a COMMITTED transaction TransactionState. transaction id: 3917247, label: vtc_source_nome__KC_ods_vtc_source_nome__KC_2__KC_loshu_ods_vtc_source_nome__KC_0__KC_495084__KC_1730487306367, db id: 11154, table id list: 508351, callback id: -1, coordinator: BE: 10.42.1.19, transaction status: COMMITTED, error replicas num: 0, replica ids: , prepare time: 1730487306375, commit time: 1730487307994, finish time: -1, reason: 
2024-11-12 08:05:14,341 INFO (stateListener|83) [DatabaseTransactionMgr.replayUpsertTransactionState():2158] replay a COMMITTED transaction TransactionState. transaction id: 3917245, label: vtc_source_nome__KC_ods_vtc_source_nome__KC_1__KC_loshu_ods_vtc_source_nome__KC_0__KC_494027__KC_1730487305225, db id: 11154, table id list: 508351, callback id: -1, coordinator: BE: 10.42.1.19, transaction status: COMMITTED, error replicas num: 0, replica ids: , prepare time: 1730487305236, commit time: 1730487308002, finish time: -1, reason: 
2024-11-12 08:05:14,341 INFO (stateListener|83) [OlapTable.updateVisibleVersionAndTime():2591] updateVisibleVersionAndTime, tableName: ods_vtc_source_nome, visibleVersion, 344672, visibleVersionTime: 1730487308007
2024-11-12 08:05:14,341 INFO (stateListener|83) [DatabaseTransactionMgr.replayUpsertTransactionState():2158] replay a VISIBLE transaction TransactionState. transaction id: 3917247, label: vtc_source_nome__KC_ods_vtc_source_nome__KC_2__KC_loshu_ods_vtc_source_nome__KC_0__KC_495084__KC_1730487306367, db id: 11154, table id list: 508351, callback id: -1, coordinator: BE: 10.42.1.19, transaction status: VISIBLE, error replicas num: 0, replica ids: , prepare time: 1730487306375, commit time: 1730487307994, finish time: 1730487308007, reason: 
2024-11-12 08:05:14,342 INFO (stateListener|83) [OlapTable.updateVisibleVersionAndTime():2591] updateVisibleVersionAndTime, tableName: ods_vtc_source_nome, visibleVersion, 344673, visibleVersionTime: 1730487308018
2024-11-12 08:05:14,342 INFO (stateListener|83) [DatabaseTransactionMgr.replayUpsertTransactionState():2158] replay a VISIBLE transaction TransactionState. transaction id: 3917245, label: vtc_source_nome__KC_ods_vtc_source_nome__KC_1__KC_loshu_ods_vtc_source_nome__KC_0__KC_494027__KC_1730487305225, db id: 11154, table id list: 508351, callback id: -1, coordinator: BE: 10.42.1.19, transaction status: VISIBLE, error replicas num: 0, replica ids: , prepare time: 1730487305236, commit time: 1730487308002, finish time: 1730487308018, reason: 
2024-11-12 08:05:14,342 INFO (stateListener|83) [DatabaseTransactionMgr.replayUpsertTransactionState():2158] replay a COMMITTED transaction TransactionState. transaction id: 3917244, label: nome_raw_data__KC_ods_vtc_nome_raw_data__KC_2__KC_loshu_ods_vtc_nome_raw_data__KC_0__KC_1611187__KC_1730487305168, db id: 11154, table id list: 74966, callback id: -1, coordinator: BE: 10.42.1.19, transaction status: COMMITTED, error replicas num: 0, replica ids: , prepare time: 1730487305177, commit time: 1730487308558, finish time: -1, reason: 
2024-11-12 08:05:14,342 INFO (stateListener|83) [DatabaseTransactionMgr.replayUpsertTransactionState():2158] replay a COMMITTED transaction TransactionState. transaction id: 3917246, label: nome_raw_data__KC_ods_vtc_nome_raw_data__KC_1__KC_loshu_ods_vtc_nome_raw_data__KC_0__KC_1612525__KC_1730487305321, db id: 11154, table id list: 74966, callback id: -1, coordinator: BE: 10.42.1.19, transaction status: COMMITTED, error replicas num: 0, replica ids: , prepare time: 1730487305400, commit time: 1730487308568, finish time: -1, reason: 
/opt/apache-doris/fe/bin/start_fe.sh: line 265:   162 Killed                  ${LIMIT:+${LIMIT}} "${JAVA}" ${final_java_opt:+${final_java_opt}} -XX:-OmitStackTraceInFastThrow -XX:OnOutOfMemoryError="kill -9 %p" ${coverage_opt:+${coverage_opt}} org.apache.doris.DorisFE ${HELPER:+${HELPER}} ${OPT_VERSION:+${OPT_VERSION}} "${METADATA_FAILURE_RECOVERY}" "$@" < /dev/null

Doris Installation via Operator, 1 BE Node and 1 FE Node, After restarting both the Doris FE and BE nodes, the FE node fails to start normally and reports the error mentioned above. The BE IP 10.42.1.19 mentioned in the error is the previous BE pod IP, not the SVC IP. The FE configuration for service discovery is set to use SVC (Service) method, but now the BE is 10.42.1.6.

image

pod network cidr 10.42.1.x/16 image svc network cidr 10.43.48.x image

What You Expected?

fix issue

How to Reproduce?

No response

Anything Else?

No response

Are you willing to submit PR?

Code of Conduct

ming12713 commented 1 week ago

In my Case, Kafka Writes to Doris via Connector Sink Mode, When Doris is restarted, the connector continues to write data. The logs parse the coordinator BE IP. Is it possible that the connector is using the StreamLoad method to write data? This data is synchronized to the FE meta with bdb, but it has not yet been synchronized to the BE. If the BE is restarted at this moment, the FE may negotiate a BE coordinator IP that it cannot connect to, causing cluster issues. Is my understanding correct?

intelligentfu8 commented 1 week ago

what's the doriscluster spec, please share the yaml. In k8s, if the IP is not static when restarted, please set enable_fqdn_mode = true to use fqdn communicate. The connector sink mode uses streamload method to insert data.

ming12713 commented 1 week ago

what's the doriscluster spec, please share the yaml. In k8s, if the IP is not static when restarted, please set enable_fqdn_mode = true to use fqdn communicate. The connector sink mode uses streamload method to insert data. @intelligentfu8 I observed the StreamLoad mechanism,The FE selects a BE (Backend) as the Coordinator node in a round-robin manner, which is responsible for scheduling the import job, and then returns an HTTP redirect to the client. The redirect uses the BE pod IP instead of svc , the reason might be related to this. https://doris.apache.org/docs/data-operate/import/import-way/stream-load-manual/

doriscluster.yaml

apiVersion: v1
items:
- apiVersion: doris.selectdb.com/v1
  kind: DorisCluster
  metadata:
    labels:
      app.kubernetes.io/instance: doriscluster
      app.kubernetes.io/name: doriscluster
      app.kubernetes.io/part-of: doris-operator
    name: doriscluster
    namespace: doris
    resourceVersion: "18187746"
    uid: 9b4d358b-ac8c-491c-8701-6a7ce61f4bdb
  spec:
    beSpec:
      annotations:
        selectdb/dorisclsuter.component: be
      envVars:
      - name: HOME
        value: /opt/selectdb
      image: selectdb/doris.be-ubuntu:2.1.1
      limits:
        cpu: 24
        memory: 64Gi
      nodeSelector:
        kubernetes.io/hostname: loshu-kube-ds01
      persistentVolumes:
      - mountPath: /opt/apache-doris/be/storage
        name: doris-be
      replicas: 1
      requests:
        cpu: 2
        memory: 8Gi
      service:
        servicePorts:
        - nodePort: 32422
          targetPort: 9060
        - nodePort: 30652
          targetPort: 8040
        - nodePort: 30891
          targetPort: 9050
        - nodePort: 31420
          targetPort: 8060
        type: NodePort
      systemInitialization:
        command:
        - /sbin/sysctl
        - -w
        - vm.max_map_count=2000000
    feSpec:
      annotations:
        selectdb/dorisclsuter.component: fe
      configMapInfo:
        configMapName: fe-configmap
        resolveKey: fe.conf
      envVars:
      - name: HOME
        value: /opt/selectdb
      image: selectdb/doris.fe-ubuntu:2.1.1
      limits:
        cpu: 8
        memory: 32Gi
      nodeSelector:
        kubernetes.io/hostname: loshu-kube-ds
      persistentVolumes:
      - mountPath: /opt/apache-doris/fe/doris-meta
        name: doris-fe
      replicas: 1
      requests:
        cpu: 2
        memory: 4Gi
      service:
        servicePorts:
        - nodePort: 30148
          targetPort: 8030
        - nodePort: 30252
          targetPort: 9020
        - nodePort: 31341
          targetPort: 9030
        type: NodePort
      systemInitialization:
        command:
        - /sbin/sysctl
        - -w
        - vm.max_map_count=2000000
intelligentfu8 commented 4 days ago

yeah, you are right. but, the selectdb community is improving the streaming load ability. they have fixed the issue on arrow flight pr. and the flink, spark is coming pr. Please reference the issue for more description.

ming12713 commented 4 days ago

yeah, you are right. but, the selectdb community is improving the streaming load ability. they have fixed the issue on arrow flight pr. and the flink, spark is coming pr. Please reference the issue for more description.

nice ,thanks !