OSv hangs when booting on XEN

cloudius-systems / osv

OSv, a new operating system for the cloud.

osv.io

Other

4.1k stars 602 forks source link

OSv hangs when booting on XEN #345

Closed tgrabiec closed 2 months ago

tgrabiec commented 10 years ago

When I start the guest with workaround for #344 applied, it hangs after printing the version line:

$ sudo ./scripts/run.py -nv -c 1 -e '/tools/netserver.so -4 -f -D'
Parsing config from /tmp/tmpB_XNBb
OSv v0.09-171-g8ea03fd

The main thread is blocked by blkfront setup:

gdb$ bt
#0  sched::thread::switch_to (this=0xffff80003eec3030, this@entry=0xffff80003e5f8030) at /home/tgrabiec/src/osv/arch/x64/arch-switch.hh:73
#1  0x00000000003b6328 in sched::cpu::reschedule_from_interrupt (this=0xffff80003f4f7030) at /home/tgrabiec/src/osv/core/sched.cc:270
#2  0x00000000003b6b23 in sched::cpu::schedule () at /home/tgrabiec/src/osv/core/sched.cc:193
#3  0x00000000003b6c16 in sched::thread::wait (this=this@entry=0xffff80003e0ba040) at /home/tgrabiec/src/osv/core/sched.cc:819
#4  0x000000000039a407 in do_wait_until<sched::noninterruptible, sched::thread::dummy_lock, waiter::wait(sched::timer*) const::__lambda4> (pred=..., mtx=<synthetic pointer>) at /home/tgrabiec/src/osv/include/osv/sched.hh:880
#5  wait_until<waiter::wait(sched::timer*) const::__lambda4> (pred=...) at /home/tgrabiec/src/osv/include/osv/sched.hh:891
#6  wait (tmr=0x2000001ffd10, this=0x2000001ffcc0) at /home/tgrabiec/src/osv/include/osv/wait_record.hh:47
#7  condvar::wait (this=this@entry=0xffffa0003e65f968, user_mutex=user_mutex@entry=0xffffa0003e65f920, tmr=tmr@entry=0x2000001ffd10) at /home/tgrabiec/src/osv/core/condvar.cc:43
#8  0x000000000035dc89 in wait<long, std::ratio<1l, 1000l> > (duration=..., user_mutex=0xffffa0003e65f920, this=0xffffa0003e65f968) at /home/tgrabiec/src/osv/include/osv/condvar.h:151
#9  xenfront::xenbus::wait_for_devices (this=this@entry=0xffffa0003e65f800) at /home/tgrabiec/src/osv/drivers/xenfront-xenbus.cc:87
#10 0x000000000035df7a in xenfront::xenbus::xenbus (this=0xffffa0003e65f800, pci_dev=...) at /home/tgrabiec/src/osv/drivers/xenfront-xenbus.cc:79
#11 0x000000000035e12c in xenfront::xenbus::probe (dev=<optimized out>) at /home/tgrabiec/src/osv/drivers/xenfront-xenbus.cc:131
#12 0x0000000000348b3d in operator() (__args#0=0xffffa0003e64f700, this=0x2000001ffe20) at /home/tgrabiec/src/osv/external/x64/gcc.bin/usr/include/c++/4.8.2/functional:2464
#13 operator() (dev=0xffffa0003e64f700, __closure=0xffffa0003e666760) at /home/tgrabiec/src/osv/drivers/driver.cc:40
#14 std::_Function_handler<void(hw::hw_device*), hw::driver_manager::load_all()::__lambda0>::_M_invoke(const std::_Any_data &, hw::hw_device *) (__functor=..., __args#0=0xffffa0003e64f700) at /home/tgrabiec/src/osv/external/x64/gcc.bin/usr/include/c++/4.8.2/functional:2071
#15 0x0000000000346da4 in operator() (__args#0=<optimized out>, this=0x2000001ffea0) at /home/tgrabiec/src/osv/external/x64/gcc.bin/usr/include/c++/4.8.2/functional:2464
#16 hw::device_manager::for_each_device(std::function<void (hw::hw_device*)>) (this=this@entry=0xffffa0003e6649c0, func=...) at /home/tgrabiec/src/osv/drivers/device.cc:67
#17 0x00000000003488bf in hw::driver_manager::load_all (this=this@entry=0xffffa0003e664e40) at /home/tgrabiec/src/osv/drivers/driver.cc:45
#18 0x0000000000367a5d in arch_init_drivers () at /home/tgrabiec/src/osv/arch/x64/arch-setup.cc:264
#19 0x0000000000210a99 in do_main_thread (_commands=0xffff80003f346f80) at /home/tgrabiec/src/osv/loader.cc:317
#20 0x0000000000417f36 in operator() (__closure=0xffffa0003f46ae00) at /home/tgrabiec/src/osv/libc/pthread.cc:75
#21 std::_Function_handler<void(), pthread_private::pthread::pthread(void* (*)(void*), void*, sigset_t, const pthread_private::thread_attr*)::__lambda2>::_M_invoke(const std::_Any_data &) (__functor=...) at /home/tgrabiec/src/osv/external/x64/gcc.bin/usr/include/c++/4.8.2/functional:2071
#22 0x00000000003b725b in main (this=0xffff80003e0ba040) at /home/tgrabiec/src/osv/core/sched.cc:813
#23 sched::thread_main_c (t=0xffff80003e0ba040) at /home/tgrabiec/src/osv/arch/x64/arch-switch.hh:137
#24 0x000000000036c506 in thread_main () at /home/tgrabiec/src/osv/arch/x64/entry.S:113

wkozaczuk commented 5 years ago

I wonder if that is still the case. I have never run OSv on Xen but given we had recently a paper measuring performance of many microservices apps - https://biblio.ugent.be/publication/8582433/file/8582438.pdf - OSv can properly run on Xen. Can someone verify it?

yfliuu commented 4 years ago

@wkozaczuk Hi! This problem happens to me too. I was using Xen 4.9 on Ubuntu 18.04. OSv was blocked on the blkfront_connect function (more precisely, at the first call to bread in vfs_bio.cc. Seems that it was put to sleep and never wake up). My guess is that the xenplatform_pci driver was somehow outdated because when I disable the xenplatform_pci driver (by commenting out the xenplatform_pci line in arch_init_drivers:arch-setup.cc) everything works fine. By the way the OSv version number is: OSv v0.55.0-3-gecba01aa. Any suggestions?

wkozaczuk commented 4 years ago

Hi @yfliuu,

Unfortunately, I do not have much experience with XEN. Besides trying to run it locally on my laptop with Dom0 being Ubuntu (like you are), I used to play with OSv and managed to deploy to EC2 xen instances fine without any problems at least a year ago or so (it would be nice to verify if it still works).

But I guess this type of XEN setup is different with potentially slightly different para-virtual drivers. I think this is what I saw when compared boot messages between AWS xen and local xen. But I think you might have just nailed a problem: I wonder if we have a bug where we mis-detect a device - xenplatform_pci and think it exists when in reality it does not. Or we have a bug in xenplatform_pci. I am guessing AWS xen does not have pci and that is why it all works. Most likely a bug in xenplatform_pci

It would be nice to figure it out. I am not sure when I will have time to look at it. We welcome patches:-)

BTW the author this paper not that long time ago tested OSv quite extensively on XEN - https://biblio.ugent.be/publication/8582433/file/8582438. He seems to have been running his tests on XenServer 7.5. But probably different setup than mine and yours.

wkozaczuk commented 4 years ago

There is this part extracting XEN PCI support - https://github.com/cloudius-systems/osv/commit/2d9f0e49d785505702df7db6a62e91159e820460. But I doubt it is a one that broke anything as it is dated 3 years after this issue was created.

kevinross commented 3 months ago

reproduced this in my environment:

using tree from commit 0288cbe3da5ed119550e6bdba7aba30c1d63ce8d
hypervisor host: xen 4.17.3 via XCP-NG 8.3 (latest beta)
build host: fedora 39 (I had to make a few changes to cflags to bypass some warnings-as-errors, I can include the diff if wanted. I think this is just due to default flags embedded in gcc et al but I could be wrong)
disabling the xenplatform-pci driver init entirely gets OSv to boot up until it tries to mount an FS, then fails since there aren't any (both rofs and zfs failed here in slightly different but effectively equivalent ways: no block devices at all. Haven't tried ramfs though)
doing conf_drivers_xen=0 gets my application booted... but fails to discover any network devices, so I have no way to access it (unsurprising given I don't see any non-pv drivers in the repo, particularly the common intel+realtek ones we can find provided by most hypervisors for guests that don't have PV drivers, so there isn't a fallback ide/scsi driver like for blockdevs -> no usable network ifaces)

build command: ./scripts/build JAVA_VERSION=21 image=openjdk-zulu-9-and-above,myapp fs=zfs fs_size_mb=512 conf_drivers_xen=0 conf_drivers_hyperv=0 conf_drivers_ena=0 conf_drivers_vmxnet3=0 conf_drivers_pvscsi=0

for every confdrivers* arg after xen above, except pvscsi, I also tried a build with each enabled, same result (no networking)

myapp is a copy of the spring example app's build-related files but with my own uberjar subbed in. It does boot all the way to where my app needs networking, at which point it exits because it needs networking.

I'm happy to help test changes (in fact, I'm going to rebase on top of the changes that were recently pushed 20 mins ago and see if that helps) but I'm quite far from OS dev in terms of knowledge base so won't be much help there.

edit: the new blkio changes didn't resolve the issue

wkozaczuk commented 3 months ago

Unfortunately, I have almost not XEN experience. Also I do not have a machine with XEN host handy.

Can you run simplest app - 'Hello World' in C (image=native-example) and ROfS (fs=rofs) with one 1 cpcu and once it hangs, connect with gdb (see https://github.com/cloudius-systems/osv/wiki/Debugging-OSv) and then run:

connect
osv syms
osv thread apply all where

and send me an output.

Also there is number of XEN-related patches from Spirent (see https://github.com/cloudius-systems/osv/issues/1197) that have never been merged upstream which possibly fix the issue you are encountering.

kevinross commented 3 months ago

Unfortunately, I have almost not XEN experience. Also I do not have a machine with XEN host handy.

Can you run simplest app - 'Hello World' in C (image=native-example) and ROfS (fs=rofs) with one 1 cpcu and once it hangs, connect with gdb (see https://github.com/cloudius-systems/osv/wiki/Debugging-OSv) and then run:
connect
osv syms
osv thread apply all where
and send me an output.

Output is here. Used image=native-example fs=rofs mode=debug, I can repeat using the release build if you'd like.

Also there is number of XEN-related patches from Spirent (see #1197) that have never been merged upstream which possibly fix the issue you are encountering.

I'll give these a try later this week, thanks!

kevinross commented 3 months ago

After a (somewhat rough) merge of those xen-related patches, output is here. Same result, except now I have no output in the console.

wkozaczuk commented 2 months ago

These two threads seem to be key:

thread 0x400001438040 xenwatch

#0  sched::thread::switch_to (this=this@entry=0x400001012040) at arch/x64/arch-switch.hh:125
#1  0x00000000403632b7 in sched::cpu::reschedule_from_interrupt (this=0x400000ecf040, called_from_yield=called_from_yield@entry=false,
    preempt_after=preempt_after@entry=std::chrono::duration = { 5000000ns }) at core/sched.cc:408
#2  0x000000004036332d in sched::cpu::schedule () at include/osv/sched.hh:1512
#3  0x000000004036390f in sched::thread::wait (this=this@entry=0x400001438040) at core/sched.cc:1421
#4  0x0000000040389b1d in sched::thread::do_wait_for<lockfree::mutex, sched::wait_object<waitqueue> > (mtx=...) at include/osv/sched.hh:1384
#5  0x0000000040389bb2 in sched::thread::wait_for<waitqueue&> (mtx=...) at include/osv/sched.hh:1400
#6  0x00000000403896a9 in waitqueue::wait (this=this@entry=0x6000013d7ba0, mtx=...) at core/waitqueue.cc:56
#7  0x00000000403aa3a8 in bio_wait (bio=0x6000013d7b00) at fs/vfs/kern_physio.cc:43
#8  0x00000000403aabbc in rw_buf (bp=bp@entry=0x40a0ab60 <buf_table>, rw=rw@entry=0) at fs/vfs/vfs_bio.cc:133
#9  0x00000000403ab1ce in bread (dev=<optimized out>, blkno=<optimized out>, bpp=0x400001441e08) at fs/vfs/vfs_bio.cc:237
#10 0x00000000403b20bc in maybe_update_partition_table (dev=dev@entry=0x6000013b8400) at fs/devfs/device.cc:127
#11 0x00000000403b257d in read_partition_table (dev=0x6000013b8400) at fs/devfs/device.cc:204
#12 0x00000000402d9ba8 in xenfront::xenfront_driver::finished (this=this@entry=0x6000013b8400) at drivers/xenfront.cc:105
#13 0x00000000402da683 in xenbus_set_state (_dev=0x6000013b8400, state=XenbusStateConnected) at drivers/xenfront-xenbus.cc:188
#14 0x000000004027cf4e in xbd_connect (sc=sc@entry=0x600001195c00) at bsd/sys/dev/xen/blkfront/blkfront.cc:1677
#15 0x000000004027d04a in xbd_backend_changed (dev=<optimized out>, backend_state=<optimized out>) at bsd/sys/dev/xen/blkfront/blkfront.cc:1856
#16 0x00000000402d9b2e in xenfront::xenfront_driver::otherend_changed (this=<optimized out>, state=<optimized out>) at drivers/xenfront.cc:30
#17 0x00000000402da3f8 in XENBUSB_OTHEREND_CHANGED (_bus=<optimized out>, _dev=<optimized out>, newstate=<optimized out>)
    at drivers/xenfront-xenbus.cc:207
#18 0x0000000040275031 in xenbusb_otherend_watch_cb (watch=<optimized out>, vec=<optimized out>, vec_size=<optimized out>)
    at bsd/sys/xen/xenbus/xenbusb.cc:152
#19 0x000000004027339d in xenwatch_thread (unused=<optimized out>) at bsd/sys/xen/xenstore/xenstore.cc:1068
#20 0x000000004022f193 in operator() (__closure=<optimized out>) at bsd/porting/kthread.cc:52
#21 std::__invoke_impl<void, kproc_create(void (*)(void*), void*, proc**, int, int, char const*, ...)::<lambda()>&> (__f=...)
    at /usr/include/c++/13/bits/invoke.h:61
#22 std::__invoke_r<void, kproc_create(void (*)(void*), void*, proc**, int, int, char const*, ...)::<lambda()>&>(struct {...} &) (__fn=...)
    at /usr/include/c++/13/bits/invoke.h:150
#23 0x000000004022f19e in std::_Function_handler<void(), kproc_create(void (*)(void*), void*, proc**, int, int, char const*, ...)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/13/bits/std_function.h:86
#24 0x0000000040207eca in std::function<void ()>::operator()() const (this=this@entry=0x400001438070) at /usr/include/c++/13/bits/std_function.h:591
#25 0x0000000040361e09 in sched::thread::main (this=this@entry=0x400001438040) at core/sched.cc:1415
#26 0x0000000040365204 in sched::thread_main_c (t=0x400001438040) at arch/x64/arch-switch.hh:377
#27 0x00000000402dfbf2 in thread_main () at arch/x64/entry.S:161

thread 0x400001443040 xenstore_rcv

#0  sched::thread::switch_to (this=this@entry=0x400001012040) at arch/x64/arch-switch.hh:125
#1  0x00000000403632b7 in sched::cpu::reschedule_from_interrupt (this=0x400000ecf040, called_from_yield=called_from_yield@entry=false,
    preempt_after=preempt_after@entry=std::chrono::duration = { 5000000ns }) at core/sched.cc:408
#2  0x000000004036332d in sched::cpu::schedule () at include/osv/sched.hh:1512
#3  0x000000004036390f in sched::thread::wait (this=this@entry=0x400001443040) at core/sched.cc:1421
#4  0x000000004022dd93 in sched::thread::do_wait_until<sched::interruptible, sched::thread::dummy_lock, synch_port::_msleep(void*, mtx*, int, char const*, int)::<lambda()> >(sched::thread::dummy_lock &, struct {...}) (mtx=..., pred=...) at include/osv/sched.hh:1246
#5  0x000000004022ea33 in sched::thread::wait_until_interruptible<synch_port::_msleep(void*, mtx*, int, char const*, int)::<lambda()> > (pred=...)
    at include/osv/sched.hh:1276
#6  synch_port::_msleep (this=0x60000105ad00, chan=<optimized out>, chan@entry=0x4000feffc000, mtx=mtx@entry=0x409c4168 <xs+8>,
    priority=priority@entry=768, wmesg=wmesg@entry=0x40628cc2 "xbread", timo_hz=<optimized out>, timo_hz@entry=0) at bsd/porting/synch.cc:102
#7  0x000000004022eb95 in _msleep (chan=0x4000feffc000, mtx=0x409c4168 <xs+8>, priority=768, wmesg=0x40628cc2 "xbread", timo=0)
    at bsd/porting/synch.cc:176
#8  0x0000000040272d52 in xs_read_store (tdata=tdata@entry=0x60000168c0d0, len=len@entry=16) at bsd/sys/xen/xenstore/xenstore.cc:591
#9  0x000000004027319c in xs_process_msg (type=type@entry=0x40000144cfbc) at bsd/sys/xen/xenstore/xenstore.cc:656
#10 0x00000000402732e6 in xs_rcv_thread (arg=<optimized out>) at bsd/sys/xen/xenstore/xenstore.cc:713
#11 0x000000004022f193 in operator() (__closure=<optimized out>) at bsd/porting/kthread.cc:52
#12 std::__invoke_impl<void, kproc_create(void (*)(void*), void*, proc**, int, int, char const*, ...)::<lambda()>&> (__f=...)
    at /usr/include/c++/13/bits/invoke.h:61
#13 std::__invoke_r<void, kproc_create(void (*)(void*), void*, proc**, int, int, char const*, ...)::<lambda()>&>(struct {...} &) (__fn=...)
    at /usr/include/c++/13/bits/invoke.h:150
#14 0x000000004022f19e in std::_Function_handler<void(), kproc_create(void (*)(void*), void*, proc**, int, int, char const*, ...)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/13/bits/std_function.h:86
#15 0x0000000040207eca in std::function<void ()>::operator()() const (this=this@entry=0x400001443070) at /usr/include/c++/13/bits/std_function.h:591
#16 0x0000000040361e09 in sched::thread::main (this=this@entry=0x400001443040) at core/sched.cc:1415
#17 0x0000000040365204 in sched::thread_main_c (t=0x400001443040) at arch/x64/arch-switch.hh:377
#18 0x00000000402dfbf2 in thread_main () at arch/x64/entry.S:161

It looks like the thread xenwatch is stuck waiting for a read from a block device to finish and xenwatch is waiting to gen an event (probably via an interrupt) and not getting it.

I am afraid I need to setup a xen machine to reproduce this issue.

Meanwhile, can you also send a dump of detailed qemu command (ps -ef | grep qemu) as OSv hangs. Could you also run it with verbose output (add -V to run.py) and send all the output you see?

wkozaczuk commented 2 months ago

Also, as I mentioned I am not an XEN expert, and the last time I used it was 4-5 years ago with Ubuntu. I presume you are running Xen on Fedora 39 as a host so I guess I can follow these setup instructions - https://docs.fedoraproject.org/en-US/quick-docs/virtualization-getting-started/#_xen and https://wiki.xenproject.org/wiki/Fedora_Host_Installation?

kevinross commented 2 months ago

@wkozaczuk

also send a dump of detailed qemu command ( ps -ef | grep qemu ) as OSv hangs. Could you also run it with verbose output

When I get back home tonight!

kevinross commented 2 months ago

@wkozaczuk I'm not actually using fedora for the host, only for the build. My host is XCP-NG 8.3, using the latest patches, and with xen 4.17.4, Linux 4.19.19 (I'm not 100% on the 19.19, I am 100% on it being 19.x and pretty sure that x is 19).

My build system is fedora 39, latest packages installed, nothing special about it other than that it's not dom0.

kevinross commented 2 months ago

Logs from the serial console (over qemu's telnet). Seems like the console driver no longer works though (I had thought it was my hack job pulling in those xen patches that had broken the console, apparently not). I'll see if I can find the commit that broke the console this weekend (it worked at one point IIRC).

EDIT: pulled latest main, no change

wkozaczuk commented 2 months ago

I have configured a machine with Fedora 39 and installed Xen on it. That way I can run Xen with fedora on Dom0:

sudo xl info
host                   : localhost.localdomain
release                : 6.5.6-300.fc39.x86_64
version                : #1 SMP PREEMPT_DYNAMIC Fri Oct  6 19:57:21 UTC 2023
machine                : x86_64
nr_cpus                : 8
max_cpu_id             : 7
nr_nodes               : 1
cores_per_socket       : 4
threads_per_core       : 2
cpu_mhz                : 3491.913
hw_caps                : bfebfbff:76faf3ff:2c100800:00000021:00000001:00002fbb:00000000:00000100
virt_caps              : pv hvm hvm_directio pv_directio hap shadow iommu_hap_pt_share gnttab-v1 gnttab-v2
total_memory           : 32520
free_memory            : 2061
sharing_freed_memory   : 0
sharing_used_memory    : 0
outstanding_claims     : 0
free_cpus              : 0
xen_major              : 4
xen_minor              : 17
xen_extra              : .4
xen_version            : 4.17.4
xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64 
xen_scheduler          : credit2
xen_pagesize           : 4096
platform_params        : virt_start=0xffff800000000000
xen_changeset          : 
xen_commandline        : placeholder
cc_compiler            : gcc (GCC) 13.2.1 20240316 (Red Hat 13.2.1-7)
cc_compile_by          : mockbuild
cc_compile_domain      : 
cc_compile_date        : Tue Apr  9 22:59:09 UTC 2024
build_id               : f1d670378a5c378c8d215912cce29c946e68f086
xend_config_format     : 4

sudo xl list
Name                                        ID   Mem VCPUs  State   Time(s)
Domain-0                                     0 30097     8     r-----     633.0

So I can build OSv on the same Fedora version. However, when I try to run OSv I do not see any output (probably Vnc is misconfigured) but also xl utility seems to get stuck trying to create new domain and new qemu process until I kill it. I am not able to connect with gdb nor can I connect with telnet (which port are you using?). In my case, OSv probably crashes earlier and as a result the xl utility keeps trying to restart it.

Here are the xl and qemu commands I grabbed with ps when they were running (just to compare to what you can see) and content of the xl config file:

sudo ./scripts/run.py -V -c 1 -p xen
arg: builder='hvm'
arg: xen_platform_pci=1
arg: acpi=1
arg: apic=1
arg: boot='c'
arg: vncdisplay=1
arg: memory=2048
arg: vcpus=1
arg: maxcpus=1
arg: name='osv-16528'
arg: disk=['/home/wkozaczuk/projects/osv/build/last/usr.img,qcow2,hda,rw']
arg: serial='pty'
arg: paused=0
arg: on_crash='preserve'
Parsing config from /tmp/tmpo454v6vw
Before closing xenfile

sudo xl list
Name                                        ID   Mem VCPUs  State   Time(s)
Domain-0                                     0 30097     8     r-----     659.2
osv-16528                                  200  2048     1     -b----       0.0

ps -ef | grep qemu
root       16639       1 12 14:08 ?        00:00:00 /usr/bin/qemu-system-i386 -xen-domid 201 -no-shutdown
-chardev socket,id=libxl-cmd,path=/var/run/xen/qmp-libxl-201,server=on,wait=off 
-mon chardev=libxl-cmd,mode=control -chardev socket,id=libxenstat-cmd,path=/var/run/xen/qmp-libxenstat-201,server=on,wait=off 
-mon chardev=libxenstat-cmd,mode=control -nodefaults 
-no-user-config -name osv-16528 -vnc 127.0.0.1:1,to=99 
-display none -serial pty -device cirrus-vga,vgamem_mb=8 
-boot order=c -net none -machine xenfv,suppress-vmdesc=on -m 2040 
-drive file=/home/wkozaczuk/projects/osv/build/last/usr.img,if=ide,index=0,media=disk,format=qcow2,cache=writeback

Also, the changes I have made to run.py:

diff --git a/scripts/run.py b/scripts/run.py
index 2dfa532c..33fff946 100755
--- a/scripts/run.py
+++ b/scripts/run.py
@@ -387,10 +387,11 @@ def start_osv_xen(options):
         "vcpus=%s" % (options.vcpus),
         "maxcpus=%s" % (options.vcpus),
         "name='osv-%d'" % (os.getpid()),
-        "disk=['/dev/loop%s,raw,hda,rw']" % os.getpid(),
+        "disk=['%s,qcow2,hda,rw']" % (options.image_file),
         "serial='pty'",
         "paused=0",
         "on_crash='preserve'"
+       # "device_model_override='/usr/bin/qemu-system-x86_64'"
     ]

     if options.networking:
@@ -406,13 +407,15 @@ def start_osv_xen(options):
     xenfile = tempfile.NamedTemporaryFile(mode="w")
     xenfile.writelines("%s\n" % item for item in args)
     xenfile.flush()
+    for item in args:
+        print("arg: %s" % item)

     try:
         # Save the current settings of the stty
         stty_save()

         #create a loop device backed by image file
-        subprocess.call(["losetup", "/dev/loop%s" % os.getpid(), options.image_file])
+        #subprocess.call(["losetup", "/dev/loop%s" % os.getpid(), options.image_file])
         # Launch qemu
         cmdline = ["xl", "create"]
         if not options.detach:
@@ -425,9 +428,10 @@ def start_osv_xen(options):
     except:
         pass
     finally:
+        print("Before closing xenfile")
         xenfile.close()
         xenfile.close()
         #delete loop device
-        subprocess.call(["losetup", "-d", "/dev/loop%s" % os.getpid()])
+        #subprocess.call(["losetup", "-d", "/dev/loop%s" % os.getpid()])
         cleanups()

 def start_osv_vmware(options):

I wonder what is different besides you using XCP-NG.

wkozaczuk commented 2 months ago

Adding -v to the xl command gives extra info:

arg: builder='hvm'
arg: xen_platform_pci=1
arg: acpi=1
arg: apic=1
arg: boot='c'
arg: vncdisplay=1
arg: memory=2048
arg: vcpus=1
arg: maxcpus=1
arg: name='osv-19713'
arg: disk=['/home/wkozaczuk/projects/osv/build/last/usr.img,qcow2,hda,rw']
arg: serial='pty'
arg: paused=0
arg: on_crash='preserve'
Parsing config from /tmp/tmp11pirahl
libxl: detail: libxl_create.c:660:libxl__domain_make: passthrough: disabled
domainbuilder: detail: xc_dom_allocate: cmdline="", features=""
domainbuilder: detail: xc_dom_kernel_file: filename="/usr/libexec/xen/boot/hvmloader"
domainbuilder: detail: xc_dom_malloc_filemap    : 174 kB
domainbuilder: detail: xc_dom_boot_xen_init: ver 4.17, caps xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64 
domainbuilder: detail: xc_dom_parse_image: called
domainbuilder: detail: xc_dom_find_loader: trying multiboot-binary loader ... 
domainbuilder: detail: loader probe failed
domainbuilder: detail: xc_dom_find_loader: trying HVM-generic loader ... 
domainbuilder: detail: loader probe OK
xc: detail: ELF: phdr: paddr=0x100000 memsz=0x33e20
xc: detail: ELF: memory: 0x100000 -> 0x133e20
domainbuilder: detail: xc_dom_compat_check: supported guest type: xen-3.0-x86_64
domainbuilder: detail: xc_dom_compat_check: supported guest type: xen-3.0-x86_32p
domainbuilder: detail: xc_dom_compat_check: supported guest type: hvm-3.0-x86_32 <= matches
domainbuilder: detail: xc_dom_compat_check: supported guest type: hvm-3.0-x86_32p
domainbuilder: detail: xc_dom_compat_check: supported guest type: hvm-3.0-x86_64
domainbuilder: detail: xc_dom_mem_init: mem 2040 MB, pages 0x7f800 pages, 4k each
domainbuilder: detail: xc_dom_mem_init: 0x7f800 pages
domainbuilder: detail: xc_dom_boot_mem_init: called
domainbuilder: detail: range: start=0x0 end=0x7f800000
xc: detail: PHYSICAL MEMORY ALLOCATION:
xc: detail:   4KB PAGES: 0x0000000000001400
xc: detail:   2MB PAGES: 0x00000000000003f2
xc: detail:   1GB PAGES: 0x0000000000000000
domainbuilder: detail: xc_dom_build_image: called
domainbuilder: detail: xc_dom_pfn_to_ptr_retcount: domU mapping: pfn 0x100+0x34 at 0x7f07c18b5000
domainbuilder: detail: xc_dom_alloc_segment:   kernel       : 0x100000 -> 0x134000  (pfn 0x100 + 0x34 pages)
xc: detail: ELF: phdr 0 at 0x7f07c1881000 -> 0x7f07c18ab3a0
domainbuilder: detail: xc_dom_pfn_to_ptr_retcount: domU mapping: pfn 0x134+0x40 at 0x7f07c1875000
domainbuilder: detail: xc_dom_alloc_segment:   System Firmware module : 0x134000 -> 0x174000  (pfn 0x134 + 0x40 pages)
domainbuilder: detail: xc_dom_pfn_to_ptr_retcount: domU mapping: pfn 0x174+0x1 at 0x7f07c1efc000
domainbuilder: detail: xc_dom_alloc_segment:   HVM start info : 0x174000 -> 0x175000  (pfn 0x174 + 0x1 pages)
domainbuilder: detail: xc_dom_build_image  : virt_alloc_end : 0x175000
domainbuilder: detail: xc_dom_build_image  : virt_pgtab_end : 0x0
domainbuilder: detail: xc_dom_boot_image: called
domainbuilder: detail: domain builder memory footprint
domainbuilder: detail:    allocated
domainbuilder: detail:       malloc             : 5904 bytes
domainbuilder: detail:       anon mmap          : 0 bytes
domainbuilder: detail:    mapped
domainbuilder: detail:       file mmap          : 174 kB
domainbuilder: detail:       domU mmap          : 468 kB
domainbuilder: detail: vcpu_hvm: called
domainbuilder: detail: xc_dom_set_gnttab_entry: d294 gnt[0] -> d0 0xfefff
domainbuilder: detail: xc_dom_set_gnttab_entry: d294 gnt[1] -> d0 0xfeffc
domainbuilder: detail: xc_dom_release: called
Before closing xenfile

kevinross commented 2 months ago

My telnet will be your xl console, from the param you passed for the serial port, so you should be able to get the console logs from that. I'd set it to pause on boot though, so you have a chance to attach to the console and not miss anything.

As for gdb, this is a bigger difference between XCP-NG and fedora: the gdb remote (gdbsx) came pre-installed on XCP, so I never had to look for the package. You might find it in one of the xen packages, particularly the dev and dbginfo ones?

Once you have gdbsx, it's fairly straightforward: run gdbsx with the appropriate flags (the help has examples) and listening on a free port, then on the build system, use the guide from the wiki for gdb, but use connect $gdbsx_host:$gdbsx_port instead of plain connect. I haven't tested every extra osv adds to gdb though , so YMMV (I was surprised that it worked out-of-the-box with gdbsx in the first place!).

wkozaczuk commented 2 months ago

OK. I have made some progress. It looks like the resulting usr.img on my Fedora 39 is not bootable even with KVM. Before I figure this out, I simply used another usr.img from my other build machine and now I can boot on both KVM and XEN. Specifically, on XEN I see the console output and it hangs at the same point as in your case:

Parsing config from /tmp/tmppo2gh6ab
OSv v0.57.0-171-g3c35189ec
1 CPUs detected
Firmware vendor: Xen
bsd: initializing - done
VFS: mounting ramfs at /
VFS: mounting devfs at /dev
net: initializing - done
vga: Add VGA device instance
Back-end specified ring-pages of 16 limited to front-end limit of 15.
Back-end specified ring-pages of 15 is not a power of 2. Limited to 8.
Back-end specified max_requests of 512 limited to front-end limit of 256.
5MB <Virtual Block Device> at device/vbd/768attaching as vblk0

I am still figuring out the gdbsx - in my case it fails like so (no idea yeat why):

sudo gdbsx -a 2 64 9999
ERROR:_check_hyp:Hyp is NOT enabled for gdbsx
ERROR: failed to attach to domain:2 errno:38

kevinross commented 2 months ago

For that gdbsx error, I suspect that you'll have to either build your own xen kernel, or use XCP. Easier option is probably to rebuild the distro rpm if it doesn't enable this hypercall by default (probably under an "enable gdbsx remote debug server" or something like that in the menuconfig).

wkozaczuk commented 2 months ago

I have made some progress. It seems the issue can be narrowed down to the blkfront driver or how it is configured with run.py. I have found an email exchange from 2016 that describes the exact same problem - https://groups.google.com/g/osv-dev/c/qXK3eIP9NAw/m/1ItfPqLCEgAJ.

I have not gotten to the bottom of it except to confirm that indeed it looks like OSv never receives relevant interrupt right after trying to read 1st block in read_partition_table() and stays stuck. But I have found a workaround based on that email exchange to boot OSv with ramfs image and the netfront part seems to be functioning properly. This seems to indicate that OSv works on XEN in general and the problem lies somewhere in the blkfront driver or its setup.

Here is what works after applying this patch:

diff --git a/drivers/xenfront.cc b/drivers/xenfront.cc
index 3edeed7b1..afd3d68bb 100644
--- a/drivers/xenfront.cc
+++ b/drivers/xenfront.cc
@@ -101,8 +101,8 @@ void xenfront_driver::set_ivars(struct xenbus_device_ivars *ivars)

 void xenfront_driver::finished()
 {
-    if (_irq_type == INTR_TYPE_BIO)
-        read_partition_table(&this->_bsd_dev);
+    //if (_irq_type == INTR_TYPE_BIO)
+    //    read_partition_table(&this->_bsd_dev);
     _bus->remove_pending(this);
 }

and this change to run.py to use the raw image (just convert using qemu-img):

diff --git a/scripts/run.py b/scripts/run.py
index 2dfa532c..6e4f75ce 100755
--- a/scripts/run.py
+++ b/scripts/run.py
@@ -351,7 +351,7 @@ def start_osv_xen(options):
             "boot='c'",
         ]
     else:
-        args = ["kernel='%s/build/%s/loader.elf'" % (os.getcwd(), options.opt_path)]
+        args = ["kernel='%s/loader-stripped.elf'" % (os.getcwd())]

     try:
         memory = int(options.memsize)
@@ -387,10 +387,14 @@ def start_osv_xen(options):
         "vcpus=%s" % (options.vcpus),
         "maxcpus=%s" % (options.vcpus),
         "name='osv-%d'" % (os.getpid()),
-        "disk=['/dev/loop%s,raw,hda,rw']" % os.getpid(),
+        "disk=['%s,raw,hda,rw']" % ('/home/wkozaczuk/projects/osv/usr.raw'),
         "serial='pty'",
         "paused=0",
         "on_crash='preserve'"
     ]

     if options.networking:
@@ -406,15 +410,18 @@ def start_osv_xen(options):
     xenfile = tempfile.NamedTemporaryFile(mode="w")
     xenfile.writelines("%s\n" % item for item in args)
     xenfile.flush()

     try:
         # Save the current settings of the stty
         stty_save()

         #create a loop device backed by image file
-        subprocess.call(["losetup", "/dev/loop%s" % os.getpid(), options.image_file])
+        #subprocess.call(["losetup", "/dev/loop%s" % os.getpid(), options.image_file])
         # Launch qemu
         cmdline = ["xl", "create"]
+        #cmdline = ["xl", "-v", "create"]
         if not options.detach:
             cmdline += ["-c"]
         cmdline += [xenfile.name]
@@ -425,9 +432,10 @@ def start_osv_xen(options):
     except:
         pass
     finally:
+        print("Before closing xenfile")
         xenfile.close()
         #delete loop device
-        subprocess.call(["losetup", "-d", "/dev/loop%s" % os.getpid()])
+        #subprocess.call(["losetup", "-d", "/dev/loop%s" % os.getpid()])
         cleanups()

and build for example lighpd:

./scripts/build fs=ramfs image=lighttpd -j$(nproc)

//Convert to raw image and run
sudo ./scripts/run.py -nv -c 1 -p xen --script vif-bridge -b virbr0
arg: builder='hvm'
arg: xen_platform_pci=1
arg: acpi=1
arg: apic=1
arg: boot='c'
arg: vncdisplay=1
arg: memory=2048
arg: vcpus=1
arg: maxcpus=1
arg: name='osv-28171'
arg: disk=['/home/wkozaczuk/projects/osv/usr.raw,raw,hda,rw']
arg: serial='pty'
arg: paused=0
arg: on_crash='preserve'
arg: vif=['bridge=virbr0,script=vif-bridge']
Parsing config from /tmp/tmp03rfpof1
OSv v0.57.0-171-g3c35189ec
1 CPUs detected
Firmware vendor: Xen
bsd: initializing - done
VFS: mounting ramfs at /
VFS: mounting devfs at /dev
net: initializing - done
vga: Add VGA device instance
eth0: ethernet address: 00:16:3e:58:bc:ea
create_netdev(): DONE
Back-end specified ring-pages of 16 limited to front-end limit of 15.
Back-end specified ring-pages of 15 is not a power of 2. Limited to 8.
Back-end specified max_requests of 512 limited to front-end limit of 256.
backend features: feature-sg feature-gso-tcp4
random: intel drng, rdrand registered as a source.
random: <Software, Yarrow> initialized
VFS: unmounting /dev
failed to pivot root, error = Invalid argument
VFS: mounting devfs at /dev
VFS: mounting procfs at /proc
VFS: mounting sysfs at /sys
[E/17 bsd-log]: eth0: link state changed to DOWN

[E/17 bsd-log]: eth0: link state changed to UP

[I/17 dhcp]: Broadcasting DHCPDISCOVER message with xid: [2112751773]
[I/17 dhcp]: Waiting for IP...
random: device unblocked.
[I/17 dhcp]: Broadcasting DHCPDISCOVER message with xid: [1798979624]
[I/17 dhcp]: Waiting for IP...
[I/17 dhcp]: Broadcasting DHCPDISCOVER message with xid: [1712394107]
[I/17 dhcp]: Waiting for IP...
[I/23 dhcp]: Received DHCPOFFER message from DHCP server: 192.168.122.1 regarding offerred IP address: 192.168.122.113
[I/23 dhcp]: Broadcasting DHCPREQUEST message with xid: [1712394107] to SELECT offered IP: 192.168.122.113
[I/17 dhcp]: Broadcasting DHCPDISCOVER message with xid: [706714133]
[I/17 dhcp]: Waiting for IP...
[I/23 dhcp]: Received DHCPOFFER message from DHCP server: 192.168.122.1 regarding offerred IP address: 192.168.122.113
[I/23 dhcp]: Broadcasting DHCPREQUEST message with xid: [706714133] to SELECT offered IP: 192.168.122.113
[I/23 dhcp]: DHCP received hostname: osv

[I/23 dhcp]: Received DHCPACK message from DHCP server: 192.168.122.1 regarding offerred IP address: 192.168.122.113
[W/23 dhcp]: Got packet with wrong transaction ID (706714133, 1712394107)
[I/23 dhcp]: DHCP received hostname: osv

[I/23 dhcp]: Received DHCPACK message from DHCP server: 192.168.122.1 regarding offerred IP address: 192.168.122.113
[I/23 dhcp]: Server acknowledged IP 192.168.122.113 for interface eth0 with time to lease in seconds: 3600
eth0: 192.168.122.113
[I/23 dhcp]: Configuring eth0: ip 192.168.122.113 subnet mask 255.255.255.0 gateway 192.168.122.1 MTU 1500
[I/23 dhcp]: Set hostname to: osv
Booted up in 11467.46 ms
Cmdline: /lighttpd.so -D -f /lighttpd/lighttpd.conf
2024-07-14 21:55:32: (configfile.c.1620) server.upload-dirs doesn't exist: /var/tmp 
2024-07-14 21:55:32: (server.c.1521) server started (lighttpd/1.4.54)

wkozaczuk commented 2 months ago

I should have also made clear that with a raw image and without the change to xenfront.cc, OSv would crash in read_partition_table() and output indicating it does not see the block device (maybe raw images are rejected quietly by qemu or xen runtime):

OSv v0.57.0-171-g3c35189ec
1 CPUs detected
Firmware vendor: Xen
bsd: initializing - done
VFS: mounting ramfs at /
VFS: mounting devfs at /dev
net: initializing - done
vga: Add VGA device instance
eth0: ethernet address: 00:16:3e:76:fa:fb
Back-end specified ring-pages of 16 limited to front-end limit of 15.
Back-end specified ring-pages of 15 is not a power of 2. Limited to 8.
Back-end specified max_requests of 512 limited to front-end limit of 256.
backend features: feature-sg feature-gso-tcp4
page fault outside application, addr: 0x0000000000000000
[registers]
RIP: 0x00000000403d4195 <???+1077756309>
RFL: 0x0000000000010246  CS:  0x0000000000000008  SS:  0x0000000000000010
RAX: 0x0000000000000000  RBX: 0x00000000408b3b80  RCX: 0x0000000000000000  RDX: 0x0000600000c69200
RSI: 0x0000000000000001  RDI: 0x0000600000c88800  RBP: 0x0000400000cf2f30  R8:  0x0000600000c88800
R9:  0xfffffffffffff988  R10: 0x0000000000000000  R11: 0x00004000001f3190  R12: 0x0000600000c88800
R13: 0xfffffffffffff988  R14: 0x0000000000000000  R15: 0x00004000001f3190  RSP: 0x0000400000cf2f10
Aborted

[backtrace]
0x000000004020613f <???+1075863871>
0x000000004031d643 <page_fault+147>
0x000000004031c372 <???+1077003122>
0x00000000403d4619 <bread+73>
0x00000000403dca82 <read_partition_table+34>
0x0000000040315290 <xenfront::xenfront_driver::finished()+48>
0x00000000402ad9c3 <xenwatch_thread(void*)+179>
0x00000000403912bd <thread_main_c+45>
0x000000004031d541 <???+1077007681>

kevinross commented 2 months ago

I can confirm that commenting out those lines works the same way in XCP.

This gives me an idea on how I could "get it working" (as opposed to fixing it properly, ie "does it work at any level of performance"): have it ignore the xen blk device in favour of the IDE one. Pretty sure this could work, although I'm not 100% since I don't remember if I used xl or XCP for the test where I had no network but did have a useable disk. Something to try this weekend perhaps.

wbdkozaczuk commented 2 months ago

I also wanted to point out that the same Xen block device works on old AWS EC2 Xen-based instances. I did test it many times including last time beginning of this year.

gstrauss commented 2 months ago

Cmdline: /lighttpd.so -D -f /lighttpd/lighttpd.conf 2024-07-14 21:55:32: (configfile.c.1620) server.upload-dirs doesn't exist: /var/tmp 2024-07-14 21:55:32: (server.c.1521) server started (lighttpd/1.4.54)

You should fix /var/tmp or specify a different path to lighttpd.conf server.upload-dirs

lightptd 1.4.54 was released May 2019, over 5 years ago. The latest lighttpd release is lighttdp 1.4.76, so you're 22 (!!!) releases behind the latest lighttpd stable release.

wkozaczuk commented 2 months ago

In the last few days, I tried to dig deeper and debug this issue more, but I still fell short of finding the root cause. But I wanted to share all the findings, so maybe someone else may notice something and find the problem or continue research with what I have found.

To debug the problem (I still have not found an easy way to fix the gdbsx problem), I have put numerous printf calls in key places in the xen code (mostly bsd/sys/dev/xen/*, bsd/sys/xen/*, drivers/xenfront*). For those interested, I am linking the patch which can be applied to the latest master commit as of now.

To better understand what is going on, I decided to bypass run.py and directly use the xl utility with pre-defined xl config files (see https://xenbits.xen.org/docs/unstable/man/xl.cfg.5.html) to boot OSv guest on Xen in 2 scenarios:

with block and networking device with raw image (converted from usr.img using qemu-img utility, see the conf file)
with block and networking device with qcow2 image (usr.img, the conf file)

sudo xl create -c /tmp/osv_raw.xl #1st scenario
sudo xl create -c /tmp/osv_img.xl #2nd scenario

In both cases, I bypassed the loop device and the whole losetup step. I am not sure we need this redirection in run.py as well. Also, having both block and networking devices would help me understand any differences how these are handled during the boot process.

In the first scenario (raw image), OSv would probe both devices, but eventually crash in the read_partition_table() function when trying to read 1st block, with a page fault triggered by null driver field of device structure. From what I can tell the block device would fail to initialize and start the closing phase and because of a logical bug OSv would call read_partition_table in xenfront_driver::finished() without checking the state of the device (see https://github.com/cloudius-systems/osv/blob/3c35189ec7b02ff3d8111caab26603f0feb84ff5/drivers/xenfront.cc#L102-L107). For full console output, please see this gist.

In the second scenario (qcow2 image), OSv also would probe both devices, but instead of crashing it would hang also in the read_partition_table() after calling the Xen block device strategy method. In this case, it looks like the block device gets initialized but OSv never receives corresponding interrupt, and the interrupt handler never gets called to complete the block strategy routine. For full console output, please see this gist.

It is also worth mentioning, that a raw or qcow2 image with ramfs filesystem and commenting out the read_partition_table() call in the xenfront_driver::finished(), boots just fine (as mentioned in one of my earlier comments) and Xen networking drivers seems to function perfectly fine. So there is something wrong with the initialization of the block device. What is puzzling, this issue was reported in June 2014, and a similar one later in 2016 on the mailing list (see https://groups.google.com/g/osv-dev/c/qXK3eIP9NAw/m/1ItfPqLCEgAJ), but then in many cases various users did not seem to find any issues with booting OSv on XEN - more detail to follow below.

To make sense of the information in the console logs I linked, I wanted to briefly explain (with my limited understanding) the device bootstrap process on Xen and what some key lines mean. More-less the device discovery process on Xen is initiated in the xenfront-xenbus driver, which eventually calls xenbusb_attach() which then calls xenbusb_enumerate_bus and xenbusb_probe_children. As the same time, the periodic task xenbusb_probe_children_cb is set up to call xenbusb_probe_children() and callback routine xenbusb_devices_changed() to call xenbusb_add_device() and enqueue subsequent call to xenbusb_probe_children_cb() under some conditions which are not clear to me.

I have annotated the linked console output with lines like --- (1), --- (2)`, etc to point to subsequent phases of the device probing process. Many lines in the output look like these:

xs_single: request:1, body:/local/domain/76/device/vbd
XENSTORE: xs_process_msg() received message:768

xs_single: request:2, body:device/vbd/768/backend
XENSTORE: xs_process_msg() received message:/local/domain/0/backend/qdisk/76/768

to capture interactions between OSv and Xen hypervisor to list directory (1st example) or read a node value (2nd example) in the Xen tree.

One can see that in both scenarios (raw and qcow2), the xenbusb_probe_children_cb() gets called 10 times (why 10?), and at the same time the xenbusb_devices_changed() callback gets invoked 14 times. What is interesting, the xenbusb_devices_changed() is only called for the block device (vbd) (not for vif). In addition in all these cases one can notice this in the output:

xenbusb_add_device: xbs_node:device, type:vbd, id:768/ => devpath:device/vbd/768/
xs_exists dir:device/vbd/768/, node:
xs_directory: dir:device/vbd/768/ node:
xs_single: request:1, body:device/vbd/768/
XENSTORE: xs_process_msg() received message:backend

and the trailing slash in the id (see 1st line).

I wonder if we have a bug in this parsing logic. I have tried to force it to remove the trailing / from the id, but this did not help. I wonder if Xen reports a change to the block device by appending some extra children nodes under device/vbd/768/ (like xda) or something and it is important to act on these. Just a speculation. BTW the Xen block interface and protocol seem to be described here - https://xenbits.xen.org/docs/unstable/hypercall/x86_64/include,public,io,blkif.h.html.

It is also worth mentioning that many people reported Xen worked at various points in time and it also works on AWS old Xen-based EC2 instances:

https://github.com/cloudius-systems/osv/issues/550 (from Nov 2014)
https://groups.google.com/g/osv-dev/c/5lfjHkZS-pk/m/B1DO9kadAQAJ (from 2017)
https://biblio.ugent.be/publication/8582433/file/8582438 (from 2019 on XenServer 7.5)

Could it be a configuration issue or a bug in how we implement block device discovery which only affects certain versions on Xen runtime?

kevinross commented 2 months ago

I find it interesting that the behaviour is different depending on the disk format, raw vs qcow2. Theoretically, qemu should be presenting the device no differently when using one vs the other, right? ie the "this is a block device, it's capacity is x, sector size y, etc" info presented to the guest should be identical between the two: how could the guest use the fact the underlying storage is qcow2, much less care? It would only matter to qemu, since it needs to know the format if there is one.

Note: for my own tests, I used a loop device backed by a raw image that I had created ahead of time. This almost definitely affects how qemu uses the disk (I mean, it could still use the same POSIX file IO calls as on a file, but it may also try to ioctl on it, dunno) but again shouldn't change how the guest sees it.

wkozaczuk commented 2 months ago

I have added an extra bit of logging in xenbusb_devices_changed() to print the node value before parsing and this clarifies things a bit:

xenbusb_devices_changed: node:[device]
xenbusb_devices_changed: node:[device/vbd/768/ring-ref2]
xenbusb_devices_changed: node:[device/vbd/768/ring-ref3]
xenbusb_devices_changed: node:[device/vbd/768/ring-ref4]
xenbusb_devices_changed: node:[device/vbd/768/ring-ref5]
xenbusb_devices_changed: node:[device/vbd/768/ring-ref6]
xenbusb_devices_changed: node:[device/vbd/768/ring-ref7]
xenbusb_devices_changed: node:[device/vbd/768/num-ring-pages]
xenbusb_devices_changed: node:[device/vbd/768/ring-page-order]
xenbusb_devices_changed: node:[device/vbd/768/max-requests]
xenbusb_devices_changed: node:[device/vbd/768/max-request-segments]
xenbusb_devices_changed: node:[device/vbd/768/max-request-size]
xenbusb_devices_changed: node:[device/vbd/768/event-channel]
xenbusb_devices_changed: node:[device/vbd/768/protocol]
xenbusb_devices_changed: node:[device/vbd/768/state]

What is interesting those seem to correspond to most writes except 1st two are missing (is this a symptom of something wrong?):

xs_write: path:device/vbd/768/state, string:1
xs_write: path:device/vbd/768/ring-ref0, string:768
xs_write: path:device/vbd/768/ring-ref1, string:769
xs_write: path:device/vbd/768/ring-ref2, string:770
xs_write: path:device/vbd/768/ring-ref3, string:771
xs_write: path:device/vbd/768/ring-ref4, string:772
xs_write: path:device/vbd/768/ring-ref5, string:773
xs_write: path:device/vbd/768/ring-ref6, string:774
xs_write: path:device/vbd/768/ring-ref7, string:775
xs_write: path:device/vbd/768/num-ring-pages, string:8
xs_write: path:device/vbd/768/ring-page-order, string:3
xs_write: path:device/vbd/768/max-requests, string:256
xs_write: path:device/vbd/768/max-request-segments, string:11
xs_write: path:device/vbd/768/max-request-size, string:40960
xs_write: path:device/vbd/768/event-channel, string:5
xs_write: path:device/vbd/768/protocol, string:x86_64-abi
xs_write: path:device/vbd/768/state, string:3
xs_write: path:device/vbd/768/state, string:5
xs_write: path:device/vbd/768/state, string:6

Also, OSv never gets similar callbacks the networking device.

Full new console logs are here for both raw and qcow2 images.

wkozaczuk commented 2 months ago

OK, I have found a culprit - a bug in setting up or handling block device with the multi-page ring buffer scheme. In other words, if I force the block device initialization code to use 1 for sc->ring_pages, OSv boots properly with both raw or qcow2 images. For details see this code - for example, if you comment the line 978 (with if), OSv would boot fine.

To be precise, I do not know what and where exactly this bug is, but I have narrowed down when it occurs. Possibly it is somewhere in setup_blkring().

For more info about multi-page buffer scheme see "Request Transport Parameters" section in https://xenbits.xen.org/docs/unstable/hypercall/x86_64/include,public,io,blkif.h.html.

BTW I bet in all these cases, when block device work on Xen, xen runtime was configured or reported max-ring-pages as 1 or max-ring-page-order as 0.

kevinross commented 2 months ago

That exact suggestion got my application to boot and function normally, thanks!

BTW I bet in all these cases, when block device work on Xen, xen runtime was configured or reported max-ring-pages as 1 or max-ring-page-order as 0.

I see your bet and double it.

wkozaczuk commented 2 months ago

I have pinpointed the exact cause - non-contiguous ring buffer allocation which matters when the buffer is longer than 1 page.

This patch illustrates it:

diff --git a/bsd/sys/dev/xen/blkfront/blkfront.cc b/bsd/sys/dev/xen/blkfront/blkfront.cc
index 08fcab3ab..4e3cd7219 100644
--- a/bsd/sys/dev/xen/blkfront/blkfront.cc
+++ b/bsd/sys/dev/xen/blkfront/blkfront.cc
@@ -98,6 +98,8 @@ static int blkif_completion(struct xb_command *);
 static void blkif_free(struct xb_softc *);
 static void blkif_queue_cb(void *, bus_dma_segment_t *, int, int);

+extern "C" void* alloc_contiguous_aligned(size_t size, size_t align);
+
 #define GRANT_INVALID_REF 0

 /* Control whether runtime update of vbds is enabled. */
@@ -1125,8 +1127,11 @@ setup_blkring(struct xb_softc *sc)
     int error;
     int i;

-    sring = (blkif_sring_t *)malloc(sc->ring_pages * PAGE_SIZE, M_XENBLOCKFRONT,
-               M_NOWAIT|M_ZERO);
+    //sring = (blkif_sring_t *)malloc(sc->ring_pages * PAGE_SIZE, M_XENBLOCKFRONT,
+    //           M_NOWAIT|M_ZERO);
+    sring = (blkif_sring_t *)alloc_contiguous_aligned(sc->ring_pages * PAGE_SIZE, PAGE_SIZE);
+    memset(sring, 0, sc->ring_pages * PAGE_SIZE);
+
     if (sring == NULL) {
         xenbus_dev_fatal(sc->xb_dev, ENOMEM, "allocating shared ring");
         return (ENOMEM);

I will prepare a pull request to fix this 10-year-old issue.

kevinross commented 2 months ago

that's definitely part of it, I'm not sure it's the whole story though: a second disk attached to the VM causes it to block, waiting for presumably something different this time, see logs and gdb here

wkozaczuk commented 2 months ago

Please send me the xl configuration file so that I can recreate it.

On Sat, Jul 20, 2024 at 13:37 Kevin Ross @.***> wrote:

that's definitely part of it, I'm not sure it's the whole story though: a second disk attached to the VM causes it to block, waiting for presumably something different this time, see logs and gdb here https://gist.github.com/kevinross/42ccec3b93f27aa7e4270d5b993ff1c9

— Reply to this email directly, view it on GitHub https://github.com/cloudius-systems/osv/issues/345#issuecomment-2241239847, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABINEINF5QD3I4GC464Z5QLZNKN57AVCNFSM6AAAAABJXRZI6KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBRGIZTSOBUG4 . You are receiving this because you were mentioned.Message ID: @.***>

kevinross commented 2 months ago

I don't have one unfortunately, this is done with XCP and the tooling doesn't use xl configs, using it's own config database instead. Best I can do is the qemu command line:

qemu-dm-31 -machine pc-0.10,accel=xen,max-ram-below-4g=4026531840,suppress-vmdesc=on,allow-unassigned=true,trad_compat=True -vnc unix:/var/run/xen/vnc-31,lock-key-sync=off -monitor null -pidfile /var/run/xen/qemu-dm-31.pid -xen-domid 31 -m size=504 -boot order=cdn -usb -device usb-tablet,port=2 -smp 1,maxcpus=1 -serial pty -display none -nodefaults -trace enable=xen_platform_log -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=deny,resourcecontrol=deny -S -global PIIX4_PM.revision_id=0x1 -global ide-hd.ver=0.10.2 -global piix3-ide-xen.subvendor_id=0x5853 -global piix3-ide-xen.subsystem_id=0x0001 -global piix3-usb-uhci.subvendor_id=0x5853 -global piix3-usb-uhci.subsystem_id=0x0001 -global rtl8139.subvendor_id=0x5853 -global rtl8139.subsystem_id=0x0001 -parallel null -qmp unix:/var/run/xen/qmp-libxl-31,server,nowait -qmp unix:/var/run/xen/qmp-event-31,server,nowait -device xen-platform,addr=3,device-id=0x0001,revision=0x2,class-id=0x0100,subvendor_id=0x5853,subsystem_id=0x0001 -drive file=,if=none,id=ide1-cd1,read-only=on -device ide-cd,drive=ide1-cd1,bus=ide.1,unit=1 -drive file=/dev/sm/backend/3db41e9f-d560-15d1-d20f-8fe37c721f9b/523dc539-0ff4-4b1d-9943-483c290f5c0e,if=none,id=ide0-hd0,auto-read-only=off,format=raw -device ide-hd,drive=ide0-hd0,bus=ide.0,unit=0,bios-chs-trans=forcelba -drive file=/dev/sm/backend/3db41e9f-d560-15d1-d20f-8fe37c721f9b/a02b1701-ef4b-4ac0-94a9-05bfa96950a1,if=none,id=ide0-hd1,auto-read-only=off,format=raw -device ide-hd,drive=ide0-hd1,bus=ide.0,unit=1,bios-chs-trans=forcelba -device rtl8139,netdev=tapnet0,mac=e2:95:52:68:23:28,addr=4 -netdev tap,id=tapnet0,fd=7 -device VGA,vgamem_mb=8,addr=2,romfile=,rombar=1,subvendor_id=0x5853,subsystem_id=0x0001,qemu-extended-regs=false -vnc-clipboard-socket-fd 4 -chardev stdio,id=ovmf -device isa-debugcon,chardev=ovmf,iobase=0x402 -xen-domid-restrict -chroot /var/xen/qemu/root-31 -runas 65566:1022

Note that I have zero control over the disk and network parameters, with the only exception being what model of NIC device to use. The disks are effectively loop devices, passed to qemu as type=raw (as you can see above), using something similar to qemu-nbd to do the I/O on the disk image. No control at all here over the model or bus type, it's all hardcoded in XCP.

wkozaczuk commented 2 months ago

I just got home and tried this xl config which specifies 2 disks (the 2nd is raw):

 cat /tmp/osv_img_plus_raw.xl 
builder='hvm'
xen_platform_pci=1
acpi=1
apic=1
boot='c'
vncdisplay=1
memory=2048
vcpus=1
maxcpus=1
name='osv-5006'
disk=['/home/wkozaczuk/projects/osv/usr.img,qcow2,hda,rw','/home/wkozaczuk/projects/osv/usr.raw,raw,hdb,rw']
serial='pty'
paused=0
on_crash='preserve'
console_autoconnect=1
vif=['bridge=virbr0,script=vif-bridge']

And I can successfully boot OSv with both disks probed:

sudo xl create -c /tmp/osv_img_plus_raw.xl
Parsing config from /tmp/osv_img_plus_raw.xl
OSv v0.57.0-171-g3c35189ec
1 CPUs detected
Firmware vendor: Xen
bsd: initializing - done
VFS: mounting ramfs at /
VFS: mounting devfs at /dev
net: initializing - done
vga: Add VGA device instance
Back-end specified ring-pages of 16 limited to front-end limit of 15.
Back-end specified ring-pages of 15 is not a power of 2. Limited to 8.
Back-end specified max_requests of 512 limited to front-end limit of 256.
eth0: ethernet address: 00:16:3e:77:7c:d9
Back-end specified ring-pages of 16 limited to front-end limit of 15.
Back-end specified ring-pages of 15 is not a power of 2. Limited to 8.
Back-end specified max_requests of 512 limited to front-end limit of 256.
backend features: feature-sg feature-gso-tcp4
5MB <Virtual Block Device> at device/vbd/768 attaching as vblk0
devfs: created device vblk0.1 for a partition at offset:4194304 with size:1716224
512MB <Virtual Block Device> at device/vbd/832 attaching as vblk1
devfs: created device vblk1.1 for a partition at offset:6291456 with size:530579456
random: intel drng, rdrand registered as a source.
random: <Software, Yarrow> initialized
VFS: unmounting /dev
VFS: mounting rofs at /rofs
VFS: mounting devfs at /dev
VFS: mounting procfs at /proc
VFS: mounting sysfs at /sys
VFS: mounting ramfs at /tmp
[E/17 bsd-log]: eth0: link state changed to DOWN

[E/17 bsd-log]: eth0: link state changed to UP

[I/17 dhcp]: Broadcasting DHCPDISCOVER message with xid: [1542625062]
[I/17 dhcp]: Waiting for IP...
random: device unblocked.
[I/17 dhcp]: Broadcasting DHCPDISCOVER message with xid: [2125495514]
[I/17 dhcp]: Waiting for IP...
[I/17 dhcp]: Broadcasting DHCPDISCOVER message with xid: [396007948]
[I/17 dhcp]: Waiting for IP...
[I/23 dhcp]: Received DHCPOFFER message from DHCP server: 192.168.122.1 regarding offerred IP address: 192.168.122.110
[I/23 dhcp]: Broadcasting DHCPREQUEST message with xid: [396007948] to SELECT offered IP: 192.168.122.110
[I/23 dhcp]: DHCP received hostname: osv

[I/23 dhcp]: Received DHCPACK message from DHCP server: 192.168.122.1 regarding offerred IP address: 192.168.122.110
[I/23 dhcp]: Server acknowledged IP 192.168.122.110 for interface eth0 with time to lease in seconds: 3600
eth0: 192.168.122.110
[I/23 dhcp]: Configuring eth0: ip 192.168.122.110 subnet mask 255.255.255.0 gateway 192.168.122.1 MTU 1500
[I/23 dhcp]: Set hostname to: osv
Booted up in 11153.70 ms
Cmdline: /lighttpd.so -D -f /lighttpd/lighttpd.conf
2024-07-20 22:32:35: (configfile.c.1620) server.upload-dirs doesn't exist: /var/tmp 
2024-07-20 22:32:35: (server.c.1521) server started (lighttpd/1.4.54) 
2024-07-20 22:33:03: (server.c.969) [note] graceful shutdown started 
2024-07-20 22:33:03: (server.c.2091) server stopped by UID = 0 PID = 0 
[I/0 dhcp]: Unicasting DHCPRELEASE message with xid: [680800914] from client: 192.168.122.110 to server: 192.168.122.1
VFS: unmounting /dev
VFS: unmounting /proc
VFS: unmounting /
ROFS: spent 4.41 ms reading from disk
ROFS: read 953 512-byte blocks from disk
ROFS: allocated 1085 512-byte blocks of cache memory
ROFS: hit ratio is 99.90%
Powering off.

I wonder if there is some configuration difference on your side.

Just in case my most recent full patch:

diff --git a/bsd/sys/dev/xen/blkfront/blkfront.cc b/bsd/sys/dev/xen/blkfront/blkfront.cc
index 08fcab3ab..1a29b3193 100644
--- a/bsd/sys/dev/xen/blkfront/blkfront.cc
+++ b/bsd/sys/dev/xen/blkfront/blkfront.cc
@@ -33,6 +33,7 @@
 __FBSDID("$FreeBSD$");

 #include <osv/mempool.hh>
+#include <osv/contiguous_alloc.hh>
 #include <bsd/porting/netport.h>
 #include <bsd/porting/synch.h>
 #include <bsd/porting/bus.h>
@@ -234,7 +235,7 @@ xlvbd_add(struct xb_softc *sc, blkif_sector_t sectors,
     sc->xb_unit = unit;

     if (strcmp(name, "xbd"))
-        device_printf(sc->xb_dev, "attaching as %s%d\n", name, unit);
+        device_printf(sc->xb_dev, " attaching as %s%d\n", name, unit);

     sc->xb_disk = disk_alloc();
     sc->xb_disk->d_unit = sc->xb_unit;
@@ -1125,8 +1126,9 @@ setup_blkring(struct xb_softc *sc)
     int error;
     int i;

-    sring = (blkif_sring_t *)malloc(sc->ring_pages * PAGE_SIZE, M_XENBLOCKFRONT,
-               M_NOWAIT|M_ZERO);
+    sring = (blkif_sring_t *)memory::alloc_phys_contiguous_aligned(
+            sc->ring_pages * PAGE_SIZE, PAGE_SIZE);
+    memset(sring, 0, sc->ring_pages * PAGE_SIZE);
     if (sring == NULL) {
         xenbus_dev_fatal(sc->xb_dev, ENOMEM, "allocating shared ring");
         return (ENOMEM);
@@ -1865,7 +1867,7 @@ blkif_free(struct xb_softc *sc)
             }
             sring_page_ptr += PAGE_SIZE;
         }
-        free(sc->ring.sring, M_XENBLOCKFRONT);
+        memory::free_phys_contiguous_aligned(sc->ring.sring);
         sc->ring.sring = NULL;
     }

diff --git a/bsd/sys/dev/xen/netfront/netfront.cc b/bsd/sys/dev/xen/netfront/netfront.cc
index c2b40aca0..244063a5a 100644
--- a/bsd/sys/dev/xen/netfront/netfront.cc
+++ b/bsd/sys/dev/xen/netfront/netfront.cc
@@ -33,6 +33,7 @@ __FBSDID("$FreeBSD$");
 #include <bsd/porting/synch.h>
 #include <bsd/porting/kthread.h>
 #include <bsd/porting/callout.h>
+#include <osv/contiguous_alloc.hh>

 #include <sys/param.h>
 #include <sys/systm.h>
@@ -628,24 +629,26 @@ setup_device(device_t dev, struct netfront_info *info)
     info->tx.sring = NULL;
     info->irq = 0;

-    txs = (netif_tx_sring_t *)malloc(PAGE_SIZE, M_DEVBUF, M_NOWAIT|M_ZERO);
+    txs = (netif_tx_sring_t *)memory::alloc_phys_contiguous_aligned(PAGE_SIZE, PAGE_SIZE);
     if (!txs) {
         error = ENOMEM;
         xenbus_dev_fatal(dev, error, "allocating tx ring page");
         goto fail;
     }
+    memset(txs, 0, PAGE_SIZE);
     SHARED_RING_INIT(txs);
     FRONT_RING_INIT(&info->tx, txs, PAGE_SIZE);
     error = xenbus_grant_ring(dev, virt_to_mfn(txs), &info->tx_ring_ref);
     if (error)
         goto fail;

-    rxs = (netif_rx_sring_t *)malloc(PAGE_SIZE, M_DEVBUF, M_NOWAIT|M_ZERO);
+    rxs = (netif_rx_sring_t *)memory::alloc_phys_contiguous_aligned(PAGE_SIZE, PAGE_SIZE);
     if (!rxs) {
         error = ENOMEM;
         xenbus_dev_fatal(dev, error, "allocating rx ring page");
         goto fail;
     }
+    memset(rxs, 0, PAGE_SIZE);
     SHARED_RING_INIT(rxs);
     FRONT_RING_INIT(&info->rx, rxs, PAGE_SIZE);

@@ -1009,6 +1012,8 @@ push:
         notify_remote_via_irq(sc->irq);
 }

+std::atomic<long> netfront_fastpath;
+
 static void
 xn_rxeof(struct netfront_info *np)
 {
@@ -1118,7 +1123,14 @@ xn_rxeof(struct netfront_info *np)
                  * If LRO fails, pass up to the stack
                  * directly.
                  */
-                (*ifp->if_input)(ifp, m);
+                //(*ifp->if_input)(ifp, m);
+                bool fast_path = ifp->if_classifier.post_packet(m);
+                if (!fast_path) {
+                    (*ifp->if_input)(ifp, m);
+                } else {
+                    netfront_fastpath.fetch_add(1);
+                }
+
             }
 #else
             (*ifp->if_input)(ifp, m);

wkozaczuk commented 2 months ago

I also wonder if the issue possibly relates to the disk size - 32768MB? Can you try smaller disk?

kevinross commented 2 months ago

I think there was some ephemeral problem in my environment because after reverting the ring_pages=1 change and applying the diff above (including the netfront one), I'm up and running with 2 disks (where the size of the second is > 64GB)!

edit to add: it appears there's something wrong with rofs, not sure if it's related:

OSv v0.57.0-173-g46d5eabf
2 CPUs detected
Firmware vendor: Xen
bsd: initializing - done
VFS: mounting ramfs at /
VFS: mounting devfs at /dev
net: initializing - done
vga: Add VGA device instance
eth0: ethernet address: f6:22:4b:80:54:b6
backend features: feature-sg feature-gso-tcp4
22MB <Virtual Block Device> at device/vbd/768 attaching as vblk0
devfs: created device vblk0.1 for a partition at offset:4194304 with size:19537920
65535MB <Virtual Block Device> at device/vbd/832 attaching as vblk1
devfs: created device vblk1.0 for a partition at offset:512 with size:240057409024
Warning: No hardware source of entropy available to your platform,
    CSPRNG will rely on software source of entropy to provide high-quality randomness.
random: <Software, Yarrow> initialized
VFS: unmounting /dev
VFS: mounting rofs at /rofs
Assertion failed: !safemode_or_autounlink || node_algorithms::inited(to_insert) (/usr/include/boost/intrusive/list.hpp: push_front: 291)

[backtrace]
0x0000000040242468 <__assert_fail+24>
0x000000004036bb3c <???+1077328700>
0x00000000403717b5 <void memory::page_range_allocator::insert<true>(memory::page_range&)+773>
0x000000004036e4ea <free+138>
0x00000000403d388d <???+1077753997>
0x00000000403cab57 <sys_mount+695>
0x00000000403c82e8 <mount_rofs_rootfs+56>
0x0000000040235d3d <do_main_thread(void*)+6653>
0x00000000403ef3d5 <???+1077867477>
0x0000000040388457 <thread_main_c+39>
0x0000000040316271 <???+1076978289>

wkozaczuk commented 2 months ago

This stack trace is identical to what I reported here - #1033. I will try to reproduce it in my environment.

wkozaczuk commented 2 months ago

I cannot reproduce it in my environment. Would you mind connecting with gdbx/gdb and collecting stack traces of all threads?

wkozaczuk commented 2 months ago

Also, after you capture that, could you please apply this patch and see if the problem goes away?

wkozaczuk commented 2 months ago

I was able to reproduce your issue with a Java image that has enough files to make the dir entries table exceed 4K (1 page). And I can confirm my patch fixes it.

kevinross commented 2 months ago

Still want me to grab those stack traces?

wkozaczuk commented 2 months ago

Nope. But please confirm if my latest patch fixes the issue with ROFS on your side. Just wanted to be double sure.

kevinross commented 2 months ago

Will do this weekend, thanks for helping to investigate and solve this!

wkozaczuk commented 1 month ago

@kevinross I have added a new Wiki page - https://github.com/cloudius-systems/osv/wiki/Running-OSv-on-Xen - to try to better document how to run OSv on XEN. Feel free to improve it and specifically add some info about running OSv on XCP-NG.

kevinross commented 1 month ago

Apologies for the late follow-up, life got busy.

Re: rofs? This works now!

kevinross commented 1 month ago

And wiki page updated with a short guide to boot OSv in a VM, along with serial console access and GDB debugging.

wkozaczuk commented 1 month ago

Thanks! Have you updated this one - https://github.com/cloudius-systems/osv/wiki/Running-OSv-on-Xen or created a new Wiki?

kevinross commented 1 month ago

Np!

Updated the one you linked