firecracker-microvm / firecracker

Secure and fast microVMs for serverless computing.
http://firecracker-microvm.io
Apache License 2.0
25.37k stars 1.77k forks source link

[Bug] Unexpected high latency on first invocation of python's os.urandom #3549

Closed wangtianxia-sjtu closed 1 year ago

wangtianxia-sjtu commented 1 year ago

Describe the bug

When I start a Firecracker virtual machine and run my Python code, the first call to the os.urandom function causes a significant delay. However, subsequent calls to the function have a lower delay.

To Reproduce

  1. Start the firecracker via the following vm_config.json
    {
    "boot-source": {
    "kernel_image_path": "vmlinux-5.10-x86_64.bin",
    "boot_args": "console=ttyS0 reboot=k panic=1 pci=off",
    "initrd_path": null
    },
    "drives": [
    {
      "drive_id": "rootfs",
      "path_on_host": "bionic.rootfs-with-python36.ext4",
      "is_root_device": true,
      "partuuid": null,
      "is_read_only": false,
      "cache_type": "Unsafe",
      "io_engine": "Sync",
      "rate_limiter": null
    }
    ],
    "machine-config": {
    "vcpu_count": 2,
    "mem_size_mib": 256,
    "smt": false,
    "track_dirty_pages": false
    },
    "balloon": null,
    "network-interfaces": [],
    "vsock": null,
    "logger": null,
    "metrics": null,
    "mmds-config": null
    }

    Start the firecracker

    firecracker --no-api --config-file ./vm_config.json
  2. Run the python 3.6 inside
Welcome to Ubuntu 18.04.6 LTS (GNU/Linux 5.10.0 x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage
This system has been minimized by removing packages and content that are
not required on a system that users do not log into.

To restore this content, you can run the 'unminimize' command.
root@ubuntu-fc-uvm:~# python3 --version
Python 3.6.9
root@ubuntu-fc-uvm:~# time python3 -c "import os; os.urandom(5)"

real    0m2.249s
user    0m0.016s
sys     0m1.207s
root@ubuntu-fc-uvm:~# time python3 -c "import os; os.urandom(5)"

real    0m0.016s
user    0m0.016s
sys     0m0.000s
root@ubuntu-fc-uvm:~# time python3 -c "import os; os.urandom(5)"

real    0m0.017s
user    0m0.013s
sys     0m0.005s

Expected behaviour

The delay of the invocation of os.urandom should not be so high....

Environment

Additional context

import numpy will use a random number from os.urandom. This will have a high delay due to the issue.

Checks

wangtianxia-sjtu commented 1 year ago

The python will use getrandom syscall to obtain random numbers in os.urandom. I have written a simple C program to reproduce the problem.

#include <stdio.h>
#include <stdlib.h>
#include <sys/random.h>

int main(int argc, char **argv) {
    if (argc < 2) {
        fprintf(stderr, "Usage: %s num_bytes\n", argv[0]);
        return 1;
    }

    int num_bytes = atoi(argv[1]);

    unsigned char *buf = malloc(num_bytes);
    if (buf == NULL) {
        perror("Failed to allocate memory");
        return 1;
    }

    if (getrandom(buf, num_bytes, 0) != num_bytes) {
        perror("getrandom failed");
        return 1;
    }

    printf("Random bytes: ");
    for (int i = 0; i < num_bytes; i++) {
        printf("%02x", buf[i]);
    }
    printf("\n");

    free(buf);

    return 0;
}

Compile the program to a.out and run:

./a.out 5

to generate 5 bytes of random numbers.

The first execution of the command in the firecracker virtual machine will block for about 2 seconds. The successive execution will take only 1 millisecond.

bchalios commented 1 year ago

Thanks for reaching out @wangtianxia-sjtu.

I suspect that what you see is expected behaviour. Quoting man 2 getrandom:

If the urandom source has not yet been initialized, then getrandom() will block, unless GRND_NONBLOCK is specified in flags.

In order to validate my hypothesis, could you:

  1. Check the contents of /proc/sys/kernel/random/entropy_avail immediately after booting?
  2. Pass the GRND_NONBLOCK flag to getrandom and see if it still blocks on the first call?

Some context: At the moment Firecracker does not emulate any entropy device, e.g. virtio-rng. As a result, some times it could be the case that during boot time the guest OS has not collected enough entropy to initialize its PRNGs.

Take a look as well at past related issues:

wangtianxia-sjtu commented 1 year ago
  1. Check /proc/sys/kernel/random/entropy_avail
    
    # Boot logs omitted
    root@ubuntu-fc-uvm:~# cat /proc/sys/kernel/random/entropy_avail
    30
    root@ubuntu-fc-uvm:~# time ./a.out 5 # generate 5 random bytes first time
    Random bytes: f4f23f8915

real 0m2.241s user 0m0.000s sys 0m1.217s root@ubuntu-fc-uvm:~# cat /proc/sys/kernel/random/entropy_avail 2 root@ubuntu-fc-uvm:~# time ./a.out 5 # generate 5 random bytes second time Random bytes: 028ff90463

real 0m0.001s user 0m0.000s sys 0m0.001s


2. Add `GRND_NONBLOCK` flag to `getrandom` call. The call will never block but will always return an error.

root@ubuntu-fc-uvm:~# time ./a.out 5 getrandom failed: Resource temporarily unavailable

real 0m0.002s user 0m0.002s sys 0m0.000s root@ubuntu-fc-uvm:~# time ./a.out 5 getrandom failed: Resource temporarily unavailable

real 0m0.002s user 0m0.001s sys 0m0.000s



PS: Is there a workaround for this? During `import numpy`, python will depend on a random number from `os.urandom`. This will cause a long latency due to this issue.
bchalios commented 1 year ago

Ok, so that indeed is the problem.

At the moment, on x86 you can tell the guest kernel to trust the host's RDRAND: https://github.com/firecracker-microvm/firecracker/issues/663#issuecomment-486174174

Another solution could be to start the rngd daemon early in the boot process: https://github.com/firecracker-microvm/firecracker/issues/663#issuecomment-481849971

wangtianxia-sjtu commented 1 year ago

Thanks, adding random.trust_cpu=on to the boot parameters will work.