OpenNebula / one

The open source Cloud & Edge Computing Platform bringing real freedom to your Enterprise Cloud 🚀
http://opennebula.io
Apache License 2.0
1.19k stars 472 forks source link

Failed to replicate log record at index: XXXX on follower #6570

Closed Franco-Sparrow closed 2 months ago

Franco-Sparrow commented 2 months ago

Description Heavy images are not successfully imported to the images datastore, because the leader changed. Storage Backend is Ceph. All orchestrators are synced and the issue is always triggered when trying to import big images to the datastore. The image to be imported stay forever in locked state and will never succed, because of the change of the leader.

To Reproduce Download from marketplace a big image, or import a big one to image datastore. It will throw the following errors, and it will change the leader, without successfully import the image to the datastore:

Follower becoming leader:

Tue Apr 23 00:02:03 2024 [Z0][DBM][I]: Purging obsolete LogDB records: 0 records purged. Log state: 0,6694 - 0,6694
Tue Apr 23 00:02:03 2024 [Z0][DBM][I]: Purging obsolete federated LogDB records: 0 records purged. Federated log size: 0.
Tue Apr 23 00:02:03 2024 [Z0][ReM][D]: Req:7104 UID:0 IP:127.0.0.1 one.zone.raftstatus invoked
Tue Apr 23 00:02:03 2024 [Z0][ReM][D]: Req:7104 UID:0 one.zone.raftstatus result SUCCESS, "<RAFT><SERVER_ID>2</..."
Tue Apr 23 00:02:10 2024 [Z0][ReM][D]: Req:2240 UID:0 IP:127.0.0.1 one.zone.raftstatus invoked
Tue Apr 23 00:02:10 2024 [Z0][ReM][D]: Req:2240 UID:0 one.zone.raftstatus result SUCCESS, "<RAFT><SERVER_ID>2</..."
Tue Apr 23 00:02:11 2024 [Z0][ReM][D]: Req:2144 UID:0 IP:127.0.0.1 one.hostpool.info invoked , -2, -1, -1
Tue Apr 23 00:02:11 2024 [Z0][ReM][D]: Req:2144 UID:0 one.hostpool.info result SUCCESS, "<HOST_POOL><HOST><ID..."
Tue Apr 23 00:02:11 2024 [Z0][ReM][D]: Req:6960 UID:0 IP:127.0.0.1 one.datastorepool.info invoked , -2, -1, -1
Tue Apr 23 00:02:11 2024 [Z0][ReM][D]: Req:6960 UID:0 one.datastorepool.info result SUCCESS, "<DATASTORE_POOL><DAT..."
Tue Apr 23 00:02:11 2024 [Z0][ReM][D]: Req:1184 UID:0 IP:127.0.0.1 one.vmpool.info invoked , -2, -1, -1, -1, ""
Tue Apr 23 00:02:11 2024 [Z0][ReM][D]: Req:1184 UID:0 one.vmpool.info result SUCCESS, "<VM_POOL><VM><ID>4</..."
Tue Apr 23 00:02:17 2024 [Z0][ReM][D]: Req:3312 UID:0 IP:127.0.0.1 one.zone.raftstatus invoked
Tue Apr 23 00:02:17 2024 [Z0][ReM][D]: Req:3312 UID:0 one.zone.raftstatus result SUCCESS, "<RAFT><SERVER_ID>2</..."
Tue Apr 23 00:02:21 2024 [Z0][RRM][E]: Failed to get heartbeat from leader. Starting election proccess
Tue Apr 23 00:02:21 2024 [Z0][RCM][I]: Vote not granted from follower 0: [one.zone.voterequest] Candidate's term is outdated
Tue Apr 23 00:02:21 2024 [Z0][RCM][I]: Follower 0 is in term 14 current term is 13. Turning into follower
Tue Apr 23 00:02:21 2024 [Z0][RCM][I]: oned is set to follower mode
Tue Apr 23 00:02:21 2024 [Z0][RRM][E]: Failed to get heartbeat from leader. Starting election proccess
Tue Apr 23 00:02:21 2024 [Z0][RCM][I]: Got vote from follower 0. Total votes: 1
Tue Apr 23 00:02:21 2024 [Z0][RCM][I]: Got majority of votes
Tue Apr 23 00:02:21 2024 [Z0][RCM][I]: Becoming leader of the zone. Last log record: 6696 last applied record: 6696
Tue Apr 23 00:02:21 2024 [Z0][RCM][I]: oned is now the leader of the zone
Tue Apr 23 00:02:22 2024 [Z0][ReM][D]: Req:6800 UID:0 IP:172.17.59.7 one.zone.voterequest invoked , 15, 0, 6696, 12
Tue Apr 23 00:02:22 2024 [Z0][ReM][E]: Req:6800 UID:0 one.zone.voterequest result FAILURE [one.zone.voterequest] Already voted for another candidate
Tue Apr 23 00:02:24 2024 [Z0][ReM][D]: Req:5200 UID:0 IP:127.0.0.1 one.zone.raftstatus invoked
Tue Apr 23 00:02:24 2024 [Z0][ReM][D]: Req:5200 UID:0 one.zone.raftstatus result SUCCESS, "<RAFT><SERVER_ID>2</..."
Tue Apr 23 00:02:24 2024 [Z0][ReM][D]: Req:3872 UID:0 IP:127.0.0.1 one.vmpool.infoextended invoked , -2, -1, -1, -1
Tue Apr 23 00:02:24 2024 [Z0][ReM][D]: Req:3872 UID:0 one.vmpool.infoextended result SUCCESS, "<VM_POOL><VM><ID>4</..."
Tue Apr 23 00:02:26 2024 [Z0][ReM][D]: Req:8192 UID:0 IP:127.0.0.1 one.hostpool.info invoked , -2, -1, -1
Tue Apr 23 00:02:26 2024 [Z0][ReM][D]: Req:8192 UID:0 one.hostpool.info result SUCCESS, "<HOST_POOL><HOST><ID..."
Tue Apr 23 00:02:26 2024 [Z0][ReM][D]: Req:5136 UID:0 IP:127.0.0.1 one.datastorepool.info invoked , -2, -1, -1
Tue Apr 23 00:02:26 2024 [Z0][ReM][D]: Req:5136 UID:0 one.datastorepool.info result SUCCESS, "<DATASTORE_POOL><DAT..."
Tue Apr 23 00:02:26 2024 [Z0][ReM][D]: Req:112 UID:0 IP:127.0.0.1 one.vmpool.info invoked , -2, -1, -1, -1, ""
Tue Apr 23 00:02:26 2024 [Z0][ReM][D]: Req:112 UID:0 one.vmpool.info result SUCCESS, "<VM_POOL><VM><ID>4</..."
Tue Apr 23 00:02:29 2024 [Z0][InM][D]: VM_STATE update from host: 1. VM id: 4, state: RUNNING
Tue Apr 23 00:02:31 2024 [Z0][ReM][D]: Req:9520 UID:0 IP:127.0.0.1 one.zone.raftstatus invoked
Tue Apr 23 00:02:31 2024 [Z0][ReM][D]: Req:9520 UID:0 one.zone.raftstatus result SUCCESS, "<RAFT><SERVER_ID>2</..."
Tue Apr 23 00:02:31 2024 [Z0][ReM][D]: Req:2560 UID:0 IP:127.0.0.1 one.vmpool.infoextended invoked , -2, -1, -1, -1
Tue Apr 23 00:02:31 2024 [Z0][ReM][D]: Req:2560 UID:0 one.vmpool.infoextended result SUCCESS, "<VM_POOL><VM><ID>4</..."
Tue Apr 23 00:02:38 2024 [Z0][ReM][D]: Req:4048 UID:0 IP:127.0.0.1 one.zone.raftstatus invoked
Tue Apr 23 00:02:38 2024 [Z0][ReM][D]: Req:4048 UID:0 one.zone.raftstatus result SUCCESS, "<RAFT><SERVER_ID>2</..."
Tue Apr 23 00:02:38 2024 [Z0][ReM][D]: Req:9936 UID:0 IP:127.0.0.1 one.vmpool.infoextended invoked , -2, -1, -1, -1
Tue Apr 23 00:02:38 2024 [Z0][ReM][D]: Req:9936 UID:0 one.vmpool.infoextended result SUCCESS, "<VM_POOL><VM><ID>4</..."
Tue Apr 23 00:02:41 2024 [Z0][ReM][D]: Req:7392 UID:0 IP:127.0.0.1 one.hostpool.info invoked , -2, -1, -1
Tue Apr 23 00:02:41 2024 [Z0][ReM][D]: Req:7392 UID:0 one.hostpool.info result SUCCESS, "<HOST_POOL><HOST><ID..."
Tue Apr 23 00:02:41 2024 [Z0][ReM][D]: Req:2672 UID:0 IP:127.0.0.1 one.datastorepool.info invoked , -2, -1, -1
Tue Apr 23 00:02:41 2024 [Z0][ReM][D]: Req:2672 UID:0 one.datastorepool.info result SUCCESS, "<DATASTORE_POOL><DAT..."
Tue Apr 23 00:02:41 2024 [Z0][ReM][D]: Req:3648 UID:0 IP:127.0.0.1 one.vmpool.info invoked , -2, -1, -1, -1, ""
Tue Apr 23 00:02:41 2024 [Z0][ReM][D]: Req:3648 UID:0 one.vmpool.info result SUCCESS, "<VM_POOL><VM><ID>4</..."
Tue Apr 23 00:02:42 2024 [Z0][HKM][I]: Hook: RAFT_LEADER_HOOK, successfully executed.
Hook stdout: ARPING 172.17.59.10 from 172.17.59.10 eth0
Sent 1 probes (1 broadcast(s))
Received 0 response(s)
ARPING 172.17.59.10 from 172.17.59.10 eth0
Sent 1 probes (1 broadcast(s))
Received 0 response(s)
ARPING 172.17.59.10 from 172.17.59.10 eth0
Sent 1 probes (1 broadcast(s))
Received 0 response(s)
ARPING 172.17.59.10 from 172.17.59.10 eth0
Sent 1 probes (1 broadcast(s))
Received 0 response(s)
ARPING 172.17.59.10 from 172.17.59.10 eth0
Sent 1 probes (1 broadcast(s))
Received 0 response(s)
ARPING 172.17.59.10 from 172.17.59.10 eth0
Sent 1 probes (1 broadcast(s))
Received 0 response(s)
ARPING 172.17.59.10 from 172.17.59.10 eth0
Sent 1 probes (1 broadcast(s))
Received 0 response(s)
ARPING 172.17.59.10 from 172.17.59.10 eth0
Sent 1 probes (1 broadcast(s))
Received 0 response(s)
ARPING 172.17.59.10 from 172.17.59.10 eth0
Sent 1 probes (1 broadcast(s))
Received 0 response(s)
ARPING 172.17.59.10 from 172.17.59.10 eth0
Sent 1 probes (1 broadcast(s))
Received 0 response(s)

Leader becoming follower:

# [...]
Tue Apr 23 00:02:11 2024 [Z0][RCM][D]: Failed to replicate log record at index: 6692 on follower: 1, error: Error replicating log entry 6692 on follower 1: RPC call timed out and aborted
Tue Apr 23 00:02:15 2024 [Z0][RCM][D]: Failed to replicate log record at index: 6692 on follower: 1, error: Error replicating log entry 6692 on follower 1: RPC call timed out and aborted
Tue Apr 23 00:02:15 2024 [Z0][ReM][D]: Req:7696 UID:0 IP:127.0.0.1 one.zone.raftstatus invoked                                                                                            
Tue Apr 23 00:02:15 2024 [Z0][ReM][D]: Req:7696 UID:0 one.zone.raftstatus result SUCCESS, "<RAFT><SERVER_ID>0</..."                                                                       
Tue Apr 23 00:02:15 2024 [Z0][ReM][D]: Req:9840 UID:0 IP:127.0.0.1 one.vmpool.infoextended invoked , -2, -1, -1, -1                                                                       
Tue Apr 23 00:02:15 2024 [Z0][ReM][D]: Req:9840 UID:0 one.vmpool.infoextended result SUCCESS, "<VM_POOL><VM><ID>4</..."                                                                   
Tue Apr 23 00:02:16 2024 [Z0][RCM][I]: Follower 1 term (13) is higher than current (12)                                                                                                   
Tue Apr 23 00:02:16 2024 [Z0][RCM][I]: oned is set to follower mode                                                                                                                       
Tue Apr 23 00:02:16 2024 [Z0][RCM][I]: Replication thread stopped                                                                                                                         
Tue Apr 23 00:02:17 2024 [Z0][HKM][I]: Hook: RAFT_FOLLOWER_HOOK, successfully executed.                                                                                                   

Image datastore info:

onedatastore show 1
DATASTORE 1 INFORMATION
ID             : 1
NAME           : images-hdd
USER           : oneadmin
GROUP          : oneadmin
CLUSTERS       : 0,100
TYPE           : IMAGE
DS_MAD         : ceph
TM_MAD         : ceph
BASE PATH      : /var/lib/one//datastores/1
DISK_TYPE      : RBD
STATE          : READY

DATASTORE CAPACITY
TOTAL:         : 3T
FREE:          : 3T
USED:          : 76.9G
LIMIT:         : 2.6T

PERMISSIONS
OWNER          : um-
GROUP          : u--
OTHER          : ---

DATASTORE TEMPLATE
ALLOW_ORPHANS="mixed"
BRIDGE_LIST="CLuster-LA-DEV-node1 CLuster-LA-DEV-node2 CLuster-LA-DEV-node3"
CEPH_HOST="CLuster-LA-DEV-node1 CLuster-LA-DEV-node2 CLuster-LA-DEV-node3"
CEPH_SECRET="18be8d6b-ae42-4c88-8d3c-878bb84292c8"
CEPH_USER="libvirt"
CLONE_TARGET="SELF"
CLONE_TARGET_SHARED="SELF"
CLONE_TARGET_SSH="SYSTEM"
COMPATIBLE_SYS_DS="0"
DATASTORE_CAPACITY_CHECK="yes"
DISK_TYPE="RBD"
DISK_TYPE_SHARED="RBD"
DISK_TYPE_SSH="FILE"
DRIVER="raw"
DS_MAD="ceph"
LIMIT_MB="2733969.9625"
LN_TARGET="NONE"
LN_TARGET_SHARED="NONE"
LN_TARGET_SSH="SYSTEM"
NAME="images-hdd"
POOL_NAME="hcx_pool1"
TM_MAD="ceph"
TM_MAD_SYSTEM="ssh,shared"
TYPE="IMAGE_DS"

IMAGES
0
3
4
5
10
12
13

Expected behavior The "heavy" image imported to the datastore.

Details

Additional context Add any other context about the problem here.

Progress Status

Franco-Sparrow commented 2 months ago

Updating datastore info. I mixed the ourput of another cluster which was not using ceph.

DATASTORE 1 INFORMATION
ID             : 1
NAME           : images-hdd
USER           : oneadmin
GROUP          : oneadmin
CLUSTERS       : 0,100
TYPE           : IMAGE
DS_MAD         : ceph
TM_MAD         : ceph
BASE PATH      : /var/lib/one//datastores/1
DISK_TYPE      : RBD
STATE          : READY

DATASTORE CAPACITY
TOTAL:         : 3T
FREE:          : 3T
USED:          : 76.9G
LIMIT:         : 2.6T

PERMISSIONS
OWNER          : um-
GROUP          : u--
OTHER          : ---

DATASTORE TEMPLATE
ALLOW_ORPHANS="mixed"
BRIDGE_LIST="CLuster-LA-DEV-node1 CLuster-LA-DEV-node2 CLuster-LA-DEV-node3"
CEPH_HOST="CLuster-LA-DEV-node1 CLuster-LA-DEV-node2 CLuster-LA-DEV-node3"
CEPH_SECRET="18be8d6b-ae42-4c88-8d3c-878bb84292c8"
CEPH_USER="libvirt"
CLONE_TARGET="SELF"
CLONE_TARGET_SHARED="SELF"
CLONE_TARGET_SSH="SYSTEM"
COMPATIBLE_SYS_DS="0"
DATASTORE_CAPACITY_CHECK="yes"
DISK_TYPE="RBD"
DISK_TYPE_SHARED="RBD"
DISK_TYPE_SSH="FILE"
DRIVER="raw"
DS_MAD="ceph"
LIMIT_MB="2733969.9625"
LN_TARGET="NONE"
LN_TARGET_SHARED="NONE"
LN_TARGET_SSH="SYSTEM"
NAME="images-hdd"
POOL_NAME="hcx_pool1"
TM_MAD="ceph"
TM_MAD_SYSTEM="ssh,shared"
TYPE="IMAGE_DS"

IMAGES
0
3
4
5
10
12
13

Thanks in advance and waiting for your response :)

Franco-Sparrow commented 2 months ago

The problem was reduced to the creation of the image in the datastore. The frontend get full its memory cache and the free memory left with nothing. This affect mysql, who start to throw Slow query in the oned.log for evey info is injected to the DB until the error shows Failed to replicate log record at index: XXXX on follower and the leader is changed. The membory/buffer start to decrease.

The frontend is an LXC container inside the compute-node. This is an HCI deployment for each frontend (3 nodes and 3 frontends, one per node). The RAM is limited to 8 GiB and the cache is shared with the node. We cant give all node's RAM to the orch just because the opennebula needs a lot of cache memory, as the image size is big, when imported to the image datastore.

LXC 5.x dropped support to swap for containers

Franco-Sparrow commented 2 months ago

The issue is also present in 6.8.1. May I have a follow up on this please?

If I cant import big images to the datastore, how would be possible to download personalized windows appliances, from private marketplace, with more that 15GB of disk size? The procedure for the image creation flood the frontend network and put the orchestrator in error state, invoking a new votation for new leader, and leaving the new image in forever LOCKED state.

Franco-Sparrow commented 2 months ago

The problem was the I/O on the compute-node OS disk, where is also the frontend LXC container.

Hi @rsmontero

My team and I would suggest the following to improve the cp of the ceph datastore driver, when storage backend is more agressive, in terms of write IOPS, than the read IOPS of your OS disk (ceph vs HDD 10K Soft-Raid):

RECOMMENDATION:

# [...]
    multiline_exec_and_log "set -e -o pipefail; $DUMP | $SSH $DST_HOST ionice -c2 -n7 $DD of=$TMP_DST bs=${DD_BLOCK_SIZE:-64k}" \
                 "Error dumping $SRC to $DST_HOST:$TMP_DST"
# [...]
        if [ "\$FORMAT" != "raw" ] && [ "\$FORMAT" != "luks" ]; then
            ionice -c2 -n7 $QEMU_IMG convert -O raw $TMP_DST $TMP_DST.raw
            #mv $TMP_DST.raw $TMP_DST
            ionice -c2 -n7 $RBD import $FORMAT_OPT $TMP_DST.raw $RBD_SOURCE
            $RM -f $TMP_DST.raw
        else
            ionice -c2 -n7 $RBD import $FORMAT_OPT $TMP_DST $RBD_SOURCE
        fi

        # remove original
        $RM -f $TMP_DST
# [...]

OLD CODE:

# [...]
    multiline_exec_and_log "set -e -o pipefail; $DUMP | $SSH $DST_HOST $DD of=$TMP_DST bs=${DD_BLOCK_SIZE:-64k}" \
                 "Error dumping $SRC to $DST_HOST:$TMP_DST"
# [...]
    if [ "\$FORMAT" != "raw" ] && [ "\$FORMAT" != "luks" ]; then
        $QEMU_IMG convert -O raw $TMP_DST $TMP_DST.raw
        mv $TMP_DST.raw $TMP_DST
    fi

    $RBD import $FORMAT_OPT $TMP_DST $RBD_SOURCE

    # remove original
    $RM -f $TMP_DST
# [...]
rsmontero commented 2 months ago

Thanks for the update @Franco-Sparrow

We plan to review TM drivers in 6.10.0 and 7.0.0 added a note here #5907 to include the proposed change