AWS wrote a white paper about Lambda, and explain to stream container images, for a better startup time. Lets play with this tool, containers (even tiny VM) are always fun to manipulate.
On-demand Container Loading in AWS Lambda
Build the make_ext4fs image
make make_ext4fs
Build the tool image
make docker-tool
Fetch some images
make img NAME=gcr.io/distroless/python3-debian12
make img NAME=gcr.io/distroless/base-debian12
# *.img files are stored in out/
ls out
# images ar full of holes
ls -lsh out/*.img
# holes are here
filefrag -v out/gcr.io_distroless_base-debian12.img
Disk images are well handled by file
$ file out/gcr.io_distroless_python3-debian12.img
out/gcr.io_distroless_python3-debian12.img: Linux rev 1.0 ext4 filesystem data, UUID=d1fa2f31-4aeb-8354-9262-b4d19504856c, volume name "stream" (extents) (large files)
Build tools (with golang)
make
Cut images in small chunks
./bin/chunk out/*.img
# recipe are stored near the img file
ls out
# chunks are stored is smr/
ls smr
Run the server
# the first arg is a recipe
./bin/server out/gcr.io_distroless_python3-debian12.img.recipe
Qemu can see it
$ qemu-img info nbd://localhost:10809/smr
image: nbd://localhost:10809/smr
file format: raw
virtual size: 1 GiB (1073741824 bytes)
disk size: unavailable
Child node '/file':
filename: nbd://localhost:10809/smr
protocol type: nbd
file length: 1 GiB (1073741824 bytes)
disk size: unavailable
Mount the image (from a Linux)
# split your tmux with ctrl-% and watch the kernel yelling at nbd
tail -f /var/log/kern.log
# nbd module should be loaded
sudo modprobe nbd
sudo nbd-client -N smr localhost 10809 /dev/nbd1
sudo mkdir /mnt/smr
sudo mount -o ro -t ext4 /dev/nbd1 /mnt/smr
ls /mnt/smr
Mount from a VM on a Mac
It works with
brew install lima
. The image is minimalist, without kernel logging making debug a bit harder.brew install multipass
. A good old fat Ubuntu image.Don't use localhost
but the host IP.
Some fixtures
make img NAME=gcr.io/distroless/base-debian12
./bin/chunk out/gcr.io_distroless_base-debian12.img
Test (and even fuzzing)
make test
make fuzz
Compare chunked and plain images
./bin/debug out/gcr.io_distroless_base-debian12.img
In real world, chunks should be stored in a replicated/cenralized place, some S3 clones, or even a Registry.
In the paper, each container has its sidekick service handling its "lazy disk", but chunks are available for all sidekicks on a server.
Each servers share a local cache, with coordinated downloading : a chunk should be downloaded once, even if it is required for starting multiple containers.
Cached just handles old chunks evictions and locking for downloading a distant chunk.
+-------------+
| Chunk store |
+------+------+
|
+--------------|-----------+
| +--------+ | |
| | Cached | | |
| +---+----+ | |
| | | |
| +---+--------+--+ |
| | Storaged | |
| +---------------+ |
| |
+--------------------------+
Cached manages the chunks lifecycle with a LRU-K, just like a plain old spinning disk.
Cached exposes some rpc via a minimalistic protocol (yamux over UNIX socket), without serialisation (binary in, binary out), msg size are announced, no buffer needed.
Locking is simple : the first user to ask for a key get a "true" answer, and try to fetch the file to the download file, then release the lock with another rpc call. Other users asking for the same key get no answer in first time, just wait the release of the lock, then get a "false" answer.
The paper says that Lambda use one more level of cache, some key/value service.