davidkuster / firecracker-containerd-experiment

FirecrackerVM and firecracker-containerd experiements
0 stars 0 forks source link

Notes from attempt on AWS metal instance #2

Open davidkuster opened 1 month ago

davidkuster commented 1 month ago
  1. After turning on the metal VM, starting firecracker with sudo firecracker-containerd --config /etc/firecracker-containerd/config.toml did not work.

I tried a bunch of things, but I think I got it working after running this again:

# Setup device mapper thin pool
sudo mkdir -p /var/lib/firecracker-containerd/snapshotter/devmapper
cd /var/lib/firecracker-containerd/snapshotter/devmapper
DIR=/var/lib/firecracker-containerd/snapshotter/devmapper
POOL=fc-dev-thinpool

if [[ ! -f "${DIR}/data" ]]; then
    sudo touch "${DIR}/data"
    sudo truncate -s 100G "${DIR}/data"
fi

if [[ ! -f "${DIR}/metadata" ]]; then
    sudo touch "${DIR}/metadata"
    sudo truncate -s 2G "${DIR}/metadata"
fi

DATADEV="$(sudo losetup --output NAME --noheadings --associated ${DIR}/data)"
if [[ -z "${DATADEV}" ]]; then
    DATADEV="$(sudo losetup --find --show ${DIR}/data)"
fi

METADEV="$(sudo losetup --output NAME --noheadings --associated ${DIR}/metadata)"
if [[ -z "${METADEV}" ]]; then
    METADEV="$(sudo losetup --find --show ${DIR}/metadata)"
fi

SECTORSIZE=512
DATASIZE="$(sudo blockdev --getsize64 -q ${DATADEV})"
LENGTH_SECTORS=$(bc <<< "${DATASIZE}/${SECTORSIZE}")
DATA_BLOCK_SIZE=128
LOW_WATER_MARK=32768
THINP_TABLE="0 ${LENGTH_SECTORS} thin-pool ${METADEV} ${DATADEV} ${DATA_BLOCK_SIZE} ${LOW_WATER_MARK} 1 skip_block_zeroing"
echo "${THINP_TABLE}"

if ! $(sudo dmsetup reload "${POOL}" --table "${THINP_TABLE}"); then
    sudo dmsetup create "${POOL}" --table "${THINP_TABLE}"
fi
  1. I added the following to /etc/containerd/firecracker-runtime.json for more logging.

    "debug": true,
    "log_levels": ["debug"],
  2. Running sudo modprobe vhost_vsock might have changed the logs, but I’m not sure.

  3. Extra debug logs show this over and over during the timeout.

    DEBU[2024-07-03T17:40:51.569705568Z]                                               attempt=193 error="temporary vsock dial failure: vsock ack message failure: failed to read \"OK <port>\" within 1s: EOF" runtime=aws.firecracker vmID=260aa652-83a4-4d4f-8bae-9155cd344b09
    DEBU[2024-07-03T17:40:51.669361850Z]                                               attempt=194 error="temporary vsock dial failure: vsock ack message failure: failed to read \"OK <port>\" within 1s: EOF" runtime=aws.firecracker vmID=260aa652-83a4-4d4f-8bae-9155cd344b09
    DEBU[2024-07-03T17:40:51.770084597Z]                                               attempt=195 error="temporary vsock dial failure: vsock ack message failure: failed to read \"OK <port>\" within 1s: EOF" runtime=aws.firecracker vmID=260aa652-83a4-4d4f-8bae-9155cd344b09
    DEBU[2024-07-03T17:40:51.869790128Z]                                               attempt=196 error="temporary vsock dial failure: vsock ack message failure: failed to read \"OK <port>\" within 1s: EOF" runtime=aws.firecracker vmID=260aa652-83a4-4d4f-8bae-9155cd344b09
    DEBU[2024-07-03T17:40:51.969491562Z]                                               attempt=197 error="temporary vsock dial failure: vsock ack message failure: failed to read \"OK <port>\" within 1s: EOF" runtime=aws.firecracker vmID=260aa652-83a4-4d4f-8bae-9155cd344b09

    I’m trying to track down that error now, but there is almost nothing. The only other person on the entire internet that has this error is here https://forums.freebsd.org/threads/freebsd-way-into-the-clouds.92168/ and they gave up on firecracker.

Latest theory is that something crashes before it is able to connect to it.

DEBU[2024-07-03T18:10:04.230881068Z] [    1.690077] systemd[1]: firecracker-agent.service: Main process exited, code=exited, status=1/FAILURE  jailer=noop runtime=aws.firecracker vmID=7c332110-0e9f-46bb-8e1b-439ff60333ef vmm_stream=stdout
DEBU[2024-07-03T18:10:04.231852395Z] [    1.691186] systemd[1]: firecracker-agent.service: Failed with result 'exit-code'.  jailer=noop runtime=aws.firecracker vmID=7c332110-0e9f-46bb-8e1b-439ff60333ef vmm_stream=stdout
davidkuster commented 1 month ago

Wondering if this could be related: https://github.com/firecracker-microvm/firecracker-containerd/issues/325

davidkuster commented 1 month ago
$ cat /proc/sys/kernel/random/entropy_avail
$ sudo apt install haveged
$ sudo systemctl start haveged
$ sudo systemctl status haveged.service

https://docs.vultr.com/add-entropy-with-haveged-to-improve-cloud-server-randomness Also see https://gist.github.com/arcenet/24015f4e34a00bbaa8f3bcc64c1745e6

Asked in the Slack community.