firecracker-microvm / firecracker

Secure and fast microVMs for serverless computing.
http://firecracker-microvm.io
Apache License 2.0
25.09k stars 1.76k forks source link

[UX] Pass configuration as file #923

Closed blitz closed 4 years ago

blitz commented 5 years ago

At the moment, Firecracker is configured by sending API requests via a AF_LOCAL socket that Firecracker creates. This means that a user of Firecracker has to:

This is unfortunate, because it requires quite a bit of code and a lot of synchronous and stateful interaction just to get a basic VM going. As an alternative, I would propose handing in the static configuration of a VM (vCPU configuration, disks, network devices, ...) as a JSON file that is provided to Firecracker when it is started (via a file or via stdin or if it's small via the command line). Firecracker then requires no additional interaction from starting the process to running guest code.

This relates to #343.

blitz commented 5 years ago

Or to put it differently:

“Simple things should be simple, complex things should be possible.” ― Alan Kay

rn commented 5 years ago

I like the idea of using a JSON file as its format can be largely the same as the REST API. I understand why the REST API is there, but for development it seems that everyone has some scripts around to start firecracker with curl invocations. I probably have three variations.

blitz commented 5 years ago

I understand why the REST API is there

I would be really interested in the rationale for why the initial VM configuration is designed this way. From my current understanding, I think this adds lots of complexity on both the user and Firecracker's side of things.

blitz commented 5 years ago

I did some crude measurements and from starting firecracker to actually running the VM (i.e. getting a response to the InstanceStart request), it takes more than 30ms. 2ms are waiting for the API socket to appear and the rest is for the 5 requests via the AF_LOCAL socket. I'd assume that most of this would go away when just using a config file.

wkozaczuk commented 5 years ago

Hi,

I have just submitted an RFC pull request that provides a rough proposal implementation that works. I have experimented with it a bit and on my 5-years old MacBook I am able to start and complete new VM within 20ms. However for some reason exiting new executable process takes additional 30-40ms. Please see output of strace:

2019-05-30T09:37:08.955724737 [anonymous-instance:ERROR:vmm/src/lib.rs:2209] init_logger() failed!
2019-05-30T09:37:08.955781053 [anonymous-instance:INFO:vmm/src/lib.rs:1388] VMM received instance start command
2019-05-30T09:37:08.955887905 [anonymous-instance:INFO:vmm/src/vstate.rs:140] Guest memory starts at 7fc083b8f000
OSv v0.53.0-29-ge37d8edb
2019-05-30T09:37:08.970544264 [anonymous-instance:INFO:vmm/src/lib.rs:2051] Guest-boot-time =  14725 us 14 ms,  13236 CPU us 13 CPU ms
Booted up in 7.49 ms
Hello from C code
2019-05-30T09:37:08.973855390 [anonymous-instance:INFO:vmm/src/lib.rs:1471] Vmm is stopping.
2019-05-30T09:37:08.974020041 [anonymous-instance:ERROR:vmm/src/lib.rs:1488] Failed to log metrics while stopping: Logger was not initialized.
2019-05-30T09:37:08.974056780 [anonymous-instance:INFO:vmm/src/lib.rs:1493] Vmm is REALLY about to stop.
INFO, sa_restorer=0x53fc77}, NULL, 8) = 0
09:37:08.951057 rt_sigaction(SIGBUS, {sa_handler=0x517460, sa_mask=[], sa_flags=SA_RESTORER|SA_ONSTACK|SA_SIGINFO, sa_restorer=0x53fc77}, NULL, 8) = 0
09:37:08.951109 sigaltstack(NULL, {ss_sp=NULL, ss_flags=SS_DISABLE, ss_size=0}) = 0
09:37:08.951161 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc087b90000
09:37:08.951213 sigaltstack({ss_sp=0x7fc087b90000, ss_flags=0, ss_size=8192}, NULL) = 0
09:37:08.951263 brk(NULL)               = 0xa02000
09:37:08.951310 brk(0xa03000)           = 0xa03000
09:37:08.951421 open("/etc/localtime", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 3
09:37:08.951491 fstat(3, {st_mode=S_IFREG|0644, st_size=3536, ...}) = 0
09:37:08.951549 mmap(NULL, 3536, PROT_READ, MAP_SHARED, 3, 0) = 0x7fc087b8f000
09:37:08.951602 close(3)                = 0
09:37:08.951716 write(2, "2019-05-30T09:37:08.951409599 [a"..., 972019-05-30T09:37:08.951409599 [anonymous-instance:ERROR:src/bin/firecracker_cmd.rs:41] Main START) = 97
09:37:08.951791 write(2, "\n", 1
)       = 1
09:37:08.951851 rt_sigaction(SIGSYS, {sa_handler=0x424000, sa_mask=~[RTMIN RT_1 RT_2], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x53fc77}, NULL, 8) = 0
09:37:08.951908 rt_sigaction(SIGBUS, {sa_handler=0x424310, sa_mask=~[RTMIN RT_1 RT_2], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x53fc77}, NULL, 8) = 0
09:37:08.951963 rt_sigaction(SIGSEGV, {sa_handler=0x424310, sa_mask=~[RTMIN RT_1 RT_2], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x53fc77}, NULL, 8) = 0
09:37:08.952023 brk(0xa04000)           = 0xa04000
09:37:08.952090 getrandom("\xfa\x84\x22\x4b\xb2\xfc\x62\x72\x79\x0f\xe1\x02\x2c\xcb\x20\x57", 16, GRND_NONBLOCK) = 16
09:37:08.952334 brk(0xa05000)           = 0xa05000
09:37:08.952533 eventfd2(0, EFD_NONBLOCK) = 3
09:37:08.952646 epoll_create1(EPOLL_CLOEXEC) = 4
09:37:08.952727 epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN, {u32=1, u64=1}}) = 0
09:37:08.952808 timerfd_create(CLOCK_MONOTONIC, TFD_CLOEXEC|TFD_NONBLOCK) = 5
09:37:08.952871 epoll_ctl(4, EPOLL_CTL_ADD, 5, {EPOLLIN, {u32=2, u64=2}}) = 0
09:37:08.952936 open("/dev/kvm", O_RDWR|O_CLOEXEC) = 6
09:37:08.953016 fcntl(6, F_SETFD, FD_CLOEXEC) = 0
09:37:08.953074 ioctl(6, KVM_GET_API_VERSION, 0) = 12
09:37:08.953132 ioctl(6, KVM_CHECK_EXTENSION, KVM_CAP_IRQCHIP) = 1
09:37:08.953192 ioctl(6, KVM_CHECK_EXTENSION, KVM_CAP_IOEVENTFD) = 1
09:37:08.953249 ioctl(6, KVM_CHECK_EXTENSION, KVM_CAP_IRQFD) = 1
09:37:08.953304 ioctl(6, KVM_CHECK_EXTENSION, KVM_CAP_SET_TSS_ADDR) = 1
09:37:08.953359 ioctl(6, KVM_CHECK_EXTENSION, KVM_CAP_USER_MEMORY) = 1
09:37:08.953414 ioctl(6, KVM_CHECK_EXTENSION, KVM_CAP_NR_MEMSLOTS) = 509
09:37:08.953470 ioctl(6, KVM_CREATE_VM, 0) = 7
09:37:08.954319 ioctl(6, KVM_GET_VCPU_MMAP_SIZE, 0) = 12288
09:37:08.954429 brk(0xa06000)           = 0xa06000
09:37:08.954528 ioctl(6, KVM_GET_SUPPORTED_CPUID, {nent=34, entries=[...]}) = 0
09:37:08.954690 eventfd2(0, EFD_NONBLOCK) = 8
09:37:08.954754 eventfd2(0, EFD_NONBLOCK) = 9
09:37:08.954812 eventfd2(0, EFD_NONBLOCK) = 10
09:37:08.954871 dup(8)                  = 11
09:37:08.954930 eventfd2(0, EFD_NONBLOCK) = 12
09:37:08.954987 dup(10)                 = 13
09:37:08.955043 brk(0xa09000)           = 0xa09000
09:37:08.955118 open("/home/wkozaczuk/projects/osv/build/release/loader-stripped.elf", O_RDONLY|O_CLOEXEC) = 14
09:37:08.955214 fcntl(14, F_SETFD, FD_CLOEXEC) = 0
09:37:08.955277 fcntl(14, F_GETFD)      = 0x1 (flags FD_CLOEXEC)
09:37:08.955339 brk(0xa1a000)           = 0xa1a000
09:37:08.955421 stat("/home/wkozaczuk/projects/osv/build/release/usr.raw", {st_mode=S_IFREG|0644, st_size=6457344, ...}) = 0
09:37:08.955508 open("/tmp/fc.log", O_RDWR|O_NONBLOCK|O_CLOEXEC) = -1 ENOENT (No such file or directory)
09:37:08.955740 write(1, "2019-05-30T09:37:08.955724737 [a"..., 99) = 99
09:37:08.955793 write(1, "2019-05-30T09:37:08.955781053 [a"..., 112) = 112
09:37:08.955824 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=3034748}) = 0
09:37:08.955853 mmap(NULL, 67108864, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7fc083b8f000
09:37:08.955899 write(1, "2019-05-30T09:37:08.955887905 [a"..., 114) = 114
09:37:08.955941 ioctl(7, KVM_SET_USER_MEMORY_REGION, {slot=0, flags=0, guest_phys_addr=0, memory_size=67108864, userspace_addr=0x7fc083b8f000}) = 0
09:37:08.956024 ioctl(7, KVM_SET_TSS_ADDR, 0xfffbd000) = 0
09:37:08.956062 ioctl(7, KVM_CREATE_IRQCHIP, 0) = 0
09:37:08.956116 ioctl(7, KVM_CREATE_PIT2, 0x7ffcefc95670) = 0
09:37:08.956220 open("/home/wkozaczuk/projects/osv/build/release/usr.raw", O_RDWR|O_CLOEXEC) = 15
09:37:08.956267 fcntl(15, F_SETFD, FD_CLOEXEC) = 0
09:37:08.956307 lseek(15, 0, SEEK_END)  = 6457344
09:37:08.956342 eventfd2(0, EFD_NONBLOCK) = 16
09:37:08.956376 eventfd2(0, EFD_NONBLOCK) = 17
09:37:08.956412 ioctl(7, KVM_IOEVENTFD, 0x7ffcefc94fc0) = 0
09:37:08.956451 ioctl(7, KVM_IRQFD, 0x7ffcefc95040) = 0
09:37:08.956503 dup(9)                  = 18
09:37:08.956716 dup(8)                  = 19
09:37:08.956775 dup(9)                  = 20
09:37:08.956854 ioctl(0, TIOCGWINSZ, {ws_row=35, ws_col=110, ws_xpixel=0, ws_ypixel=0}) = 0
09:37:08.957107 ioctl(0, TCGETS, {B38400 opost isig icanon echo ...}) = 0
09:37:08.957172 ioctl(0, SNDCTL_TMR_START or TCSETS, {B38400 opost -isig -icanon -echo ...}) = 0
09:37:08.957213 ioctl(7, KVM_IRQFD, 0x7ffcefc95670) = 0
09:37:08.957268 ioctl(7, KVM_IRQFD, 0x7ffcefc95670) = 0
09:37:08.957318 ioctl(7, KVM_IRQFD, 0x7ffcefc95670) = 0
09:37:08.957343 lseek(14, 0, SEEK_SET)  = 0
09:37:08.957365 read(14, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\2\0>\0\1\0\0\0*\241;\0\0\0\0\0"..., 64) = 64
09:37:08.957394 lseek(14, 64, SEEK_SET) = 64
09:37:08.957418 read(14, "\1\0\0\0\7\0\0\0\0\0\0\0\0\0\0\0\0\0 \0\0\0\0\0\0\0 \0\0\0\0\0"..., 280) = 280
09:37:08.957441 lseek(14, 0, SEEK_SET)  = 0
09:37:08.957463 read(14, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\2\0>\0\1\0\0\0*\241;\0\0\0\0\0"..., 6967428) = 6967428
09:37:08.961384 lseek(14, 6967424, SEEK_SET) = 6967424
09:37:08.961430 read(14, "\1\0\0\0", 4) = 4
09:37:08.961472 ioctl(7, KVM_CREATE_VCPU, 0) = 21
09:37:08.961833 mmap(NULL, 12288, PROT_READ|PROT_WRITE, MAP_SHARED, 21, 0) = 0x7fc083b8b000
09:37:08.961874 ioctl(21, KVM_SET_CPUID2, {nent=34, entries=[...]}) = 0
09:37:08.961907 ioctl(21, KVM_SET_MSRS, 0xa19200) = 10
09:37:08.961933 ioctl(21, KVM_SET_REGS, {rax=0, ..., rsp=0x8ff0, rbp=0x8ff0, ..., rip=0x3ba12a, rflags=0x2}) = 0
09:37:08.961960 ioctl(21, KVM_SET_FPU, 0x7ffcefc95670) = 0
9:37:08.961983 ioctl(21, KVM_GET_SREGS, {cs={base=0xffff0000, limit=65535, selector=61440, type=11, present=1, dpl=0, db=0, s=1, l=0, g=0, avl=0}, ...}) = 0
09:37:08.962033 ioctl(21, KVM_SET_SREGS, {cs={base=0, limit=1048575, selector=8, type=11, present=1, dpl=0, db=0, s=1, l=1, g=1, avl=0}, ...}) = 0
09:37:08.962067 ioctl(21, KVM_GET_LAPIC, 0x7ffcefc95670) = 0
09:37:08.962097 ioctl(21, KVM_SET_LAPIC, 0x7ffcefc96670) = 0
09:37:08.962132 dup(12)                 = 22
09:37:08.962155 epoll_ctl(4, EPOLL_CTL_ADD, 22, {EPOLLIN, {u32=6, u64=6}}) = 0
09:37:08.962181 epoll_ctl(4, EPOLL_CTL_ADD, 0, {EPOLLIN, {u32=0, u64=0}}) = 0
09:37:08.962207 dup(12)                 = 23
09:37:08.962236 rt_sigprocmask(SIG_UNBLOCK, [RT_1 RT_2], NULL, 8) = 0
09:37:08.962261 mmap(NULL, 2105344, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc083989000
09:37:08.962285 mprotect(0x7fc08398a000, 2101248, PROT_READ|PROT_WRITE) = 0
09:37:08.962328 clone(child_stack=0x7fc083b8a9d8, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID|0x400000, parent_tidptr=0x7fc083b8ab38, tls=0x7fc083b8ab00, child_tidptr=0x7fc083b8ab40) = 6160
09:37:08.962464 brk(0xa1b000)           = 0xa1b000
09:37:08.962499 futex(0x7773a0, FUTEX_WAKE_PRIVATE, 1) = 1
09:37:08.962536 futex(0x7773a0, FUTEX_WAIT_PRIVATE, 1, NULL) = -1 EAGAIN (Resource temporarily unavailable)
09:37:08.962568 brk(0xa1e000)           = 0xa1e000
09:37:08.962596 futex(0x7773a0, FUTEX_WAKE_PRIVATE, 1) = 1
09:37:08.962718 prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) = 0
09:37:08.962750 prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, {len=393, filter=0xa1ddc0}) = 0
09:37:08.962926 futex(0x7fc083b8a574, FUTEX_WAKE_PRIVATE, 1) = 1
09:37:08.962955 futex(0xa19124, FUTEX_WAKE_PRIVATE, 1) = 1
09:37:08.962985 timerfd_settime(5, 0, {it_interval={tv_sec=60, tv_nsec=0}, it_value={tv_sec=60, tv_nsec=0}}, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=0}}) = 0
09:37:08.963025 epoll_pwait(4, [{EPOLLIN, {u32=3, u64=3}}], 100, -1, NULL, 8) = 1
09:37:08.969749 read(16, "\1\0\0\0\0\0\0\0", 8) = 8
09:37:08.969801 lseek(15, 0, SEEK_SET)  = 0
09:37:08.969834 read(15, "\352^|\0\0\0\0\0\0\0\0\0\0\0\0\0\177\0\20\0@\0\0\0\0\200\200\0\0\0\0\0"..., 512) = 512
09:37:08.969875 write(24, "\1\0\0\0\0\0\0\0", 8) = 8
09:37:08.969907 epoll_pwait(4, [{EPOLLIN, {u32=3, u64=3}}], 100, -1, NULL, 8) = 1
09:37:08.970058 read(16, "\1\0\0\0\0\0\0\0", 8) = 8
09:37:08.970085 lseek(15, 6291456, SEEK_SET) = 6291456
09:37:08.970109 read(15, "\255\276\255\336\0\0\0\0\1\0\0\0\0\0\0\0\0\2\0\0\0\0\0\0B\1\0\0\0\0\0\0"..., 512) = 512
09:37:08.970136 write(24, "\1\0\0\0\0\0\0\0", 8) = 8
09:37:08.970162 epoll_pwait(4, [{EPOLLIN, {u32=3, u64=3}}], 100, -1, NULL, 8) = 1
09:37:08.970205 read(16, "\1\0\0\0\0\0\0\0", 8) = 8
09:37:08.970232 lseek(15, 6456320, SEEK_SET) = 6456320
09:37:08.970256 read(15, "\7\0\0\0\0\0\0\0\6\0mnttab\10\0\0\0\0\0\0\0\5\0fstab\t"..., 1024) = 1024
09:37:08.970282 write(24, "\1\0\0\0\0\0\0\0", 8) = 8
09:37:08.970308 epoll_pwait(4, [{EPOLLIN, {u32=3, u64=3}}], 100, -1, NULL, 8) = 1
09:37:08.970364 read(16, "\1\0\0\0\0\0\0\0", 8) = 8
09:37:08.970391 lseek(15, 6305792, SEEK_SET) = 6305792
09:37:08.970415 read(15, "/dev/vblk0.1 /          rofs    "..., 512) = 512
09:37:08.970442 write(24, "\1\0\0\0\0\0\0\0", 8) = 8
09:37:08.970468 epoll_pwait(4, [{EPOLLIN, {u32=3, u64=3}}], 100, -1, NULL, 8) = 1
09:37:08.970900 read(16, "\1\0\0\0\0\0\0\0", 8) = 8
09:37:08.970957 lseek(15, 6291968, SEEK_SET) = 6291968
09:37:08.970990 read(15, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0@\20\0\0\0\0\0\0"..., 13824) = 13824
09:37:08.971060 write(24, "\1\0\0\0\0\0\0\0", 8) = 8
09:37:08.971107 epoll_pwait(4, [{EPOLLIN, {u32=3, u64=3}}], 100, -1, NULL, 8) = 1
09:37:08.971200 read(16, "\1\0\0\0\0\0\0\0", 8) = 8
09:37:08.971234 lseek(15, 6425088, SEEK_SET) = 6425088
09:37:08.971257 read(15, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P\20\0\0\0\0\0\0"..., 16896) = 16896
09:37:08.971321 write(24, "\1\0\0\0\0\0\0\0", 8) = 8
09:37:08.971353 epoll_pwait(4, [{EPOLLIN, {u32=6, u64=6}}], 100, -1, NULL, 8) = 1
09:37:08.973828 read(22, "\1\0\0\0\0\0\0\0", 8) = 8
09:37:08.973872 write(1, "2019-05-30T09:37:08.973855390 [a"..., 93) = 93
09:37:08.973906 epoll_ctl(4, EPOLL_CTL_DEL, 0, 0x7ffcefc99430) = 0
09:37:08.973934 ioctl(0, TIOCGWINSZ, {ws_row=35, ws_col=110, ws_xpixel=0, ws_ypixel=0}) = 0
09:37:08.973964 ioctl(0, TCGETS, {B38400 opost -isig -icanon -echo ...}) = 0
09:37:08.973994 ioctl(0, SNDCTL_TMR_START or TCSETS, {B38400 opost isig icanon echo ...}) = 0
09:37:08.974031 write(1, "2019-05-30T09:37:08.974020041 [a"..., 143) = 143
09:37:08.974079 write(1, "2019-05-30T09:37:08.974056780 [a"..., 105) = 105
09:37:08.974119 exit_group(0)           = ?
09:37:09.011016 +++ exited with 0 +++

As you can see it started at 09:37:08.951057 and exit_group() was called at 09:37:08.974119 (23 milliseconds in this case) but per last line 09:37:09.011016 +++ exited with 0 +++ the process did not die for another 35 milliseconds.

What could be a cause of it? As far as I understand exit_group() terminates all threads and in this case there is one vCPU thread apart from the main one. But I do not understand why it would take so long for this thread to terminate. Any ideas? Where is a flaw in my implementation?

blitz commented 5 years ago

@wkozaczuk Very nice! One comment: strace is usually not a good way to measure time, because the system call tracing it uses is very inefficient. So likely the performance is better than what you measured above! For measurements like these, it's better to have the application itself generate timestamps and log them.

the process did not die for another 35 milliseconds. What could be a cause of it?

Wild guess: Maybe it only counts as fully destroyed once the parent process fetches the exit status via wait().

wkozaczuk commented 5 years ago

Since my last comments I have made more improvements to my "command-line" prototype of firecracker with aim to improve startup and shutdown time. The shutdown part is really where I think the crux of the problem lies. I ended up creating another branch (for those interested - https://github.com/wkozaczuk/firecracker/tree/sockless_optimize, please see the diff between this and older sockless branch). I mostly followed @blitz's advice to make vcpu thread (child thread) end orderly and main parent thread join it instead of calling exit() on guest shutdown request.

In general I managed to shave off around 10ms but there is still 20ms wasted somehow after the process terminates as reported per time.

Here are the average times from the 25-times run of firecracker launching simple OSv hello world example with single vCPU back-to-back:

# 25 timestamp from the start to the point main thread joins with vCPU thread after guest exits
2019-06-17T13:49:12.640563968 [anonymous-instance:INFO:vmm/src/lib.rs:2092] Complete-time =  21322 us 21 ms,  21223 CPU us 21 CPU ms
2019-06-17T13:49:12.682214105 [anonymous-instance:INFO:vmm/src/lib.rs:2092] Complete-time =  18836 us 18 ms,  18594 CPU us 18 CPU ms
2019-06-17T13:49:12.730072139 [anonymous-instance:INFO:vmm/src/lib.rs:2092] Complete-time =  18434 us 18 ms,  18361 CPU us 18 CPU ms
2019-06-17T13:49:12.775873532 [anonymous-instance:INFO:vmm/src/lib.rs:2092] Complete-time =  20468 us 20 ms,  20379 CPU us 20 CPU ms
2019-06-17T13:49:12.820857486 [anonymous-instance:INFO:vmm/src/lib.rs:2092] Complete-time =  21327 us 21 ms,  21293 CPU us 21 CPU ms
2019-06-17T13:49:12.866860434 [anonymous-instance:INFO:vmm/src/lib.rs:2092] Complete-time =  23101 us 23 ms,  22889 CPU us 22 CPU ms
2019-06-17T13:49:12.909018552 [anonymous-instance:INFO:vmm/src/lib.rs:2092] Complete-time =  17343 us 17 ms,  17037 CPU us 17 CPU ms
2019-06-17T13:49:12.949402089 [anonymous-instance:INFO:vmm/src/lib.rs:2092] Complete-time =  18104 us 18 ms,  17914 CPU us 17 CPU ms
2019-06-17T13:49:12.994497962 [anonymous-instance:INFO:vmm/src/lib.rs:2092] Complete-time =  19124 us 19 ms,  18996 CPU us 18 CPU ms
2019-06-17T13:49:13.047007523 [anonymous-instance:INFO:vmm/src/lib.rs:2092] Complete-time =  18947 us 18 ms,  19130 CPU us 19 CPU ms
2019-06-17T13:49:13.092817255 [anonymous-instance:INFO:vmm/src/lib.rs:2092] Complete-time =  17294 us 17 ms,  17246 CPU us 17 CPU ms
2019-06-17T13:49:13.135817648 [anonymous-instance:INFO:vmm/src/lib.rs:2092] Complete-time =  16739 us 16 ms,  16446 CPU us 16 CPU ms
2019-06-17T13:49:13.177466580 [anonymous-instance:INFO:vmm/src/lib.rs:2092] Complete-time =  17724 us 17 ms,  17634 CPU us 17 CPU ms
2019-06-17T13:49:13.225039463 [anonymous-instance:INFO:vmm/src/lib.rs:2092] Complete-time =  17527 us 17 ms,  17728 CPU us 17 CPU ms
2019-06-17T13:49:13.266592089 [anonymous-instance:INFO:vmm/src/lib.rs:2092] Complete-time =  19303 us 19 ms,  19168 CPU us 19 CPU ms
2019-06-17T13:49:13.318211946 [anonymous-instance:INFO:vmm/src/lib.rs:2092] Complete-time =  23030 us 23 ms,  23048 CPU us 23 CPU ms
2019-06-17T13:49:13.363688065 [anonymous-instance:INFO:vmm/src/lib.rs:2092] Complete-time =  20529 us 20 ms,  20286 CPU us 20 CPU ms
2019-06-17T13:49:13.406490452 [anonymous-instance:INFO:vmm/src/lib.rs:2092] Complete-time =  19238 us 19 ms,  19122 CPU us 19 CPU ms
2019-06-17T13:49:13.457479297 [anonymous-instance:INFO:vmm/src/lib.rs:2092] Complete-time =  17869 us 17 ms,  17884 CPU us 17 CPU ms
2019-06-17T13:49:13.502832220 [anonymous-instance:INFO:vmm/src/lib.rs:2092] Complete-time =  19518 us 19 ms,  19333 CPU us 19 CPU ms
2019-06-17T13:49:13.557177521 [anonymous-instance:INFO:vmm/src/lib.rs:2092] Complete-time =  21434 us 21 ms,  21575 CPU us 21 CPU ms
2019-06-17T13:49:13.597015436 [anonymous-instance:INFO:vmm/src/lib.rs:2092] Complete-time =  17657 us 17 ms,  17285 CPU us 17 CPU ms
2019-06-17T13:49:13.649213336 [anonymous-instance:INFO:vmm/src/lib.rs:2092] Complete-time =  21849 us 21 ms,  21942 CPU us 21 CPU ms
2019-06-17T13:49:13.696826433 [anonymous-instance:INFO:vmm/src/lib.rs:2092] Complete-time =  21320 us 21 ms,  21375 CPU us 21 CPU ms
#  
# Complete-time average = 18.28 ms (total 457 ms)
#
# 25 timestamp from the start to the last statement in the main thread before the process gracefully terminates
2019-06-17T13:49:12.648992634 [anonymous-instance:INFO:src/bin/firecracker_cmd.rs:216] Main-complete-time =  30299 us 30 ms
2019-06-17T13:49:12.692673905 [anonymous-instance:INFO:src/bin/firecracker_cmd.rs:216] Main-complete-time =  29789 us 29 ms
2019-06-17T13:49:12.740738937 [anonymous-instance:INFO:src/bin/firecracker_cmd.rs:216] Main-complete-time =  29698 us 29 ms
2019-06-17T13:49:12.784871116 [anonymous-instance:INFO:src/bin/firecracker_cmd.rs:216] Main-complete-time =  30045 us 30 ms
2019-06-17T13:49:12.828920584 [anonymous-instance:INFO:src/bin/firecracker_cmd.rs:216] Main-complete-time =  29935 us 29 ms
2019-06-17T13:49:12.877089259 [anonymous-instance:INFO:src/bin/firecracker_cmd.rs:216] Main-complete-time =  33938 us 33 ms
2019-06-17T13:49:12.916101376 [anonymous-instance:INFO:src/bin/firecracker_cmd.rs:216] Main-complete-time =  25037 us 25 ms
2019-06-17T13:49:12.960673183 [anonymous-instance:INFO:src/bin/firecracker_cmd.rs:216] Main-complete-time =  29906 us 29 ms
2019-06-17T13:49:13.009732058 [anonymous-instance:INFO:src/bin/firecracker_cmd.rs:216] Main-complete-time =  34972 us 34 ms
2019-06-17T13:49:13.056692283 [anonymous-instance:INFO:src/bin/firecracker_cmd.rs:216] Main-complete-time =  29389 us 29 ms
2019-06-17T13:49:13.100850529 [anonymous-instance:INFO:src/bin/firecracker_cmd.rs:216] Main-complete-time =  25972 us 25 ms
2019-06-17T13:49:13.144763049 [anonymous-instance:INFO:src/bin/firecracker_cmd.rs:216] Main-complete-time =  26165 us 26 ms
2019-06-17T13:49:13.185151758 [anonymous-instance:INFO:src/bin/firecracker_cmd.rs:216] Main-complete-time =  26083 us 26 ms
2019-06-17T13:49:13.232242773 [anonymous-instance:INFO:src/bin/firecracker_cmd.rs:216] Main-complete-time =  25286 us 25 ms
2019-06-17T13:49:13.276543838 [anonymous-instance:INFO:src/bin/firecracker_cmd.rs:216] Main-complete-time =  29784 us 29 ms
2019-06-17T13:49:13.328460388 [anonymous-instance:INFO:src/bin/firecracker_cmd.rs:216] Main-complete-time =  33780 us 33 ms
2019-06-17T13:49:13.372791351 [anonymous-instance:INFO:src/bin/firecracker_cmd.rs:216] Main-complete-time =  30171 us 30 ms
2019-06-17T13:49:13.417434986 [anonymous-instance:INFO:src/bin/firecracker_cmd.rs:216] Main-complete-time =  30694 us 30 ms
2019-06-17T13:49:13.468387611 [anonymous-instance:INFO:src/bin/firecracker_cmd.rs:216] Main-complete-time =  29453 us 29 ms
2019-06-17T13:49:13.513398517 [anonymous-instance:INFO:src/bin/firecracker_cmd.rs:216] Main-complete-time =  30631 us 30 ms
2019-06-17T13:49:13.564446818 [anonymous-instance:INFO:src/bin/firecracker_cmd.rs:216] Main-complete-time =  29488 us 29 ms
2019-06-17T13:49:13.609289232 [anonymous-instance:INFO:src/bin/firecracker_cmd.rs:216] Main-complete-time =  30427 us 30 ms
2019-06-17T13:49:13.657701352 [anonymous-instance:INFO:src/bin/firecracker_cmd.rs:216] Main-complete-time =  30879 us 30 ms
2019-06-17T13:49:13.704409707 [anonymous-instance:INFO:src/bin/firecracker_cmd.rs:216] Main-complete-time =  29505 us 29 ms
#  
# Main-complete-time average = 27.92 ms (total 692 ms)
#
real    0m1.153s
user    0m0.140s
sys 0m0.489s
# The average run per time is 46.12 ms

As you can see on average it took:

So there are 2 questions outstanding:

  1. What is the time spent on between the point vCPU loop terminates and the point the main thread completes?
  2. What is the time spent on between the point the main thread completes and the point Linux considers the process really terminated per time utility?

Regarding 1) it might be useful to look at this fragment of strace output:

.....
2019-06-17T14:43:16.884706949 [anonymous-instance:ERROR:vmm/src/vstate.rs:441] run_emulation() loop broke!
14:43:16.884730 read(22, "\1\0\0\0\0\0\0\0", 8) = 8
14:43:16.884775 write(1, "2019-06-17T14:43:16.884758950 [a"..., 1062019-06-17T14:43:16.884758950 [anonymous-instance:ERROR:vmm/src/lib.rs:1559] Stopping from EpollDispatch!
) = 106
14:43:16.884829 write(1, "2019-06-17T14:43:16.884807772 [a"..., 1162019-06-17T14:43:16.884807772 [anonymous-instance:ERROR:vmm/src/lib.rs:2258] Gracefully terminated VMM control loop
) = 116
14:43:16.884913 write(1, "2019-06-17T14:43:16.884897515 [a"..., 932019-06-17T14:43:16.884897515 [anonymous-instance:INFO:vmm/src/lib.rs:1491] Vmm is stopping.
) = 93
14:43:16.884948 epoll_ctl(4, EPOLL_CTL_DEL, 0, 0x7ffd3ff6b1c0) = 0
14:43:16.884980 ioctl(0, TIOCGWINSZ, {ws_row=46, ws_col=225, ws_xpixel=1575, ws_ypixel=736}) = 0
14:43:16.885009 ioctl(0, TCGETS, {B9600 opost -isig -icanon -echo ...}) = 0
14:43:16.885037 ioctl(0, SNDCTL_TMR_START or TCSETS, {B9600 opost isig icanon echo ...}) = 0
14:43:16.885093 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=18464886}) = 0
14:43:16.885133 write(1, "2019-06-17T14:43:16.885120335 [a"..., 1332019-06-17T14:43:16.885120335 [anonymous-instance:INFO:vmm/src/lib.rs:2092] Complete-time =  17088 us 17 ms,  15458 CPU us 15 CPU ms
) = 133
14:43:16.885179 munmap(0x7f98e4d20000, 2105344) = 0
14:43:16.885234 write(1, "2019-06-17T14:43:16.885223404 [a"..., 1142019-06-17T14:43:16.885223404 [anonymous-instance:INFO:vmm/src/lib.rs:1516] Vmm is REALLY about to stop .. again.
) = 114
14:43:16.885269 close(6)                = 0
14:43:16.885302 close(14)               = 0
14:43:16.885343 close(22)               = 0
14:43:16.885368 close(7)                = 0
14:43:16.893616 close(17)               = 0
14:43:16.893658 close(20)               = 0
14:43:16.893690 close(18)               = 0
14:43:16.893720 close(19)               = 0
14:43:16.893762 close(11)               = 0
14:43:16.893795 close(12)               = 0
14:43:16.893826 close(13)               = 0
14:43:16.893857 close(8)                = 0
14:43:16.893887 close(9)                = 0
14:43:16.893950 close(10)               = 0
14:43:16.893983 close(4)                = 0
14:43:16.894034 munmap(0x7f98e4f26000, 67108864) = 0
14:43:16.896906 close(15)               = 0
14:43:16.896944 close(24)               = 0
14:43:16.896977 close(16)               = 0
14:43:16.897005 close(25)               = 0
14:43:16.897036 close(3)                = 0
14:43:16.897080 close(5)                = 0
14:43:16.897150 write(1, "2019-06-17T14:43:16.897106867 [a"..., 1242019-06-17T14:43:16.897106867 [anonymous-instance:INFO:src/bin/firecracker_cmd.rs:216] Main-complete-time =  33134 us 33 ms
) = 124
14:43:16.897210 sigaltstack({ss_sp=NULL, ss_flags=SS_DISABLE, ss_size=8192}, NULL) = 0
14:43:16.897259 munmap(0x7f98e8f27000, 8192) = 0
14:43:16.897317 exit_group(0)           = ?
14:43:16.914219 +++ exited with 0 +++

real    0m0.059s
user    0m0.004s
sys 0m0.032s

These I think are "new" (comparing to the output I reported 2 weeks ago) system call that come of firecracker_cmd orderly closing all (?) file descriptors. I am new to Rust nor I understand firecracker code well, but I think it comes from one one of the drop() method (destructors) possibly triggered by destruction of the Vmm object? EpollContext? The bottom line is that (at least in this example) it takes ~ 10 ms to close around 15 file descriptors (one of them for /dev/kvm ?).

As far as the 2) question goes I have no idea what happens there. There must be some other resources Linux needs to close when handling implicit exit(). Any other ideas of what these could be?

Some wild guess - is it because of seccomp?

andreeaflorescu commented 5 years ago

A few weeks ago we had a design discussion about passing a configuration file instead of using API calls to configure the initial state of the microVM.

Passing the configuration

For passing the initial configuration we have a few options as follows:

  1. Command line parameters. At the moment we don't consider this to be the right way of implementing an alternative to the API server. When we first started Firecracker we used command line parameters for configuring resources such as network, boot source, drives and others. While the API server was under development we choose to keep both configuration options available. As time passed we had a few bugs due to the fact that the API server and the command line parameters were getting out of sync. As it was hard to maintain them both with only the API server being used in production, we decided to drop the command line parameters and today we still stand by this choice. For prototyping with Firecracker, people can use firectl, a command line tool which allows you to configure and run microVMs without interacting with the API server.
  2. Configuration file. The configuration file should be in the JSON format so that it can be easily deserialized into existing VMM structures using serde. In the meeting we identified some disadvantages to using files. For the usecases where the files need to be created on the fly, it adds overhead for writing the files on the customer side. There will also be overhead on the Firecracker side when reading the configuration file.
  3. Configuration via STDIN. This is currently the preferred mechanism for providing the initial configuration because it removes the overhead of creating files. If Firecracker users want to use files instead of STDIN, this can be easily achieved by redirecting the file contents to STDIN. The configuration should be in JSON format and map to existing VMM structures.

Running Firecracker without the API

Running Firecracker without the API should be possible, but the default will still be having the API server. We have two options here:

  1. Conditional compilation (with Rust features) for compiling out the API server. This should be a default Firecracker feature that can be disable by using: cargo build --release --no-default-features. The advantages of this solution are faster build time (as you don't need to download all the hyper and tokio dependencies) and smaller binary. It might make the code hard to read and add multiple cfgs if we have multiple usages of the api_server. I believe this isn't the case and this would be my favourite option. With this solution we don't need to pass a --no-api flag to Firecracker because we can use the cfg macro and the API code will be dropped hence not executed.
  2. No changes. We can compile the API server code, just not use it. We would also not initialize the API server (start a thread for it). The advantages/disadvantages are reversed from the conditional compilation option.

Some other requirements

Deserializing the configuration

For deserializing the configuration my proposal would be to define a VmmConfig structure that derives Serde deserialize. The fields of this struct are vmm_config structures which also should implement serde deserialize. For this to work, the Firecracker code should be refactored such that the ..Config structures are static in the sense that they do not contain live objects like files, eventfds or other similar objects. We should also discuss about having separate API and VMM structures where API structures are static and use serde serialize/deserialize and the VMM structures can have live objects and probably shouldn't have the standard serialize/deserialize functionality derived from serde.

use serde::Deserialize;

#[derive(Deserialize)]
#[serde(deny_unknown_fields)]
struct VmmConfig {
    boot_source: BootSourceConfig,
    block_devices: BlockDeviceConfigs,
    network_devices: NetwordDeviceConfigs,
    ...
}

The current configuration files do not derive Serde deserialize at the moment, so this also needs to be changed.

CC: @firecracker-microvm/compute-capsule please let me know if you have any feedback/other comments.

alexandruag commented 5 years ago

Hi! I like the STDIN configuration best. We can add a new parameter which represents that configuration data should be read from stdin. I guess we'll have to extend this over to the jailer as well.

Regarding Running Firecracker without the API I'd go with the second option for now. It looks difficult to cleanly separate the API server from the Vmm struct via conditional compilation until we do some pretty significant refactoring.

I like the bit about having separate API and VMM structures (at least where they wouldn't overlap perfectly), and it's something we should pursue. However, for the time being, is there anything wrong with using a composition of the current API structures?

andreeaflorescu commented 5 years ago

Hi! I like the STDIN configuration best. We can add a new parameter which represents that configuration data should be read from stdin. I guess we'll have to extend this over to the jailer as well.

Do you mean that we should only read the configuration from stdin when that flag is present? Or that we should support both file and stdin? I don't want to have them both honestly as it complicates the code for no good reason. I think we should stick to one (either file or stdin).

Regarding Running Firecracker without the API I'd go with the second option for now. It looks difficult to cleanly separate the API server from the Vmm struct via conditional compilation until we do some pretty significant refactoring.

Why do you say that? vmm does not depend on api_server.

I like the bit about having separate API and VMM structures (at least where they wouldn't overlap perfectly), and it's something we should pursue. However, for the time being, is there anything wrong with using a composition of the current API structures?

I think the problem is adding more technical dept, which I would very much like to avoid. To workaround this issue I think we can prioritize the aforementioned refactoring and block this issue till that is done.

wkozaczuk commented 5 years ago

@andreeaflorescu Thanks for the very nice write-up. It really helps.

I am definitely in favor of "Configuration via STDIN" - most elegant and flexible. As far as "Running Firecracker without API", I am in favor of "Conditional compilation", mostly because of the smaller executable and fewer dependencies.

Also, it feels to me that somehow "Passing the configuration" and "Running Firecracker without API" deserve to be separate issues (though very related), no?

Also, I was wondering if the potential implementation should be part of a single PR or many smaller ones? If many, in what order/way we should divide it to allow for best coding and reviewing experience?

andreeaflorescu commented 5 years ago

@wkozaczuk I agree with you and running Firecracker without the API should be a separate issue. We can create a few more issues related to this one and "convert" this one to a story (aka we can track the progress of all other issues in this one).

If we have multiple PRs is far more easier to review and the chance of getting them merged fast is higher. I will think about a split that actually makes sense. But before that we should decide if we want to include the refactoring related to separate structures for VMM and API. I am inclined to say yes as I am afraid that otherwise the code will be a bit spaghetti and parts of it will be thrown away later.

CC: @firecracker-microvm/compute-capsule please take a look at the suggestions.

acatangiu commented 5 years ago

Configuration via STDIN (using same structs/api as when configuring through unix-socket) is my preferred choice as well.

Regarding the api-server when running with stdin input I am leaning towards option 2 of simply not initializing/starting the api thread. Conditional compilation brings in extra complexity on the testing side and less versatility for customers. I don't see the argument of smaller attack surface as particularly strong in this case.

As for separating resources structs into descriptor structs and functional structs, it would be an improvement, but I don't believe it necessary. Would be great if the person tackling this story would also tackle this particular topic, but I wouldn't make it a requirement.

alxiord commented 5 years ago
andreeaflorescu commented 5 years ago

It looks like we have an agreement regarding file vs stdin and we want to move forward with stdin. As this is not necessarily related to the other issues, I propose tracking them in separate issues and resolving them in separate PRs.

For this purpose I've opened two issues:

I think we should postpone the decision of using conditional compilation vs soft disable for now. Our API server is currently rather complex and brings in a lot of dependencies (last time I checked there were ~70 transient dependencies). It would be interesting to know how does the complexity of the api_server translates into memory and build time overhead. But, since we are trying to implement our custom api_server and thus removing hyper and tokio, I would say to take a decision after we start working on #615.

petreeftime commented 5 years ago

My concern about the STDIN variant is that parsing JSON out of a stream is generally error-prone: if the document is badly formatted you might not be able to tell easily (eg. missing " or }).

Also, from a usability standpoint:

Is there are any reason not to support both files and STDIN? This shouldn't be very complex to implement.

alexandruag commented 5 years ago

Hi everyone! In light of @petreeftime's comment, and thinking a bit more about the different implications of STDIN vs other ways of passing parameters, how about actually using a single command line parameter to pass the entire JSON content (for example ./firecracker --startup-config 'ACTUAL_JSON_BODY'). It seems easy to support having the actual JSON config in a file with this approach as well.

We mentioned using command line parameters for configuration passing, but I think we were still reminiscing about the old way of having different parameters for different VM configuration options (such as vCPU number, memory size, etc.), and not simply using the value of a single parameter to hold the entire JSON config. Also, one problem of the STDIN approach that I totally forgot about initially is that the Firecracker process created by the jailer has STDIN bound to /dev/null whenever the --daemonize flag is present, and we'd have to go around that somehow.

dianpopa commented 4 years ago

Solved by #1212.