ProjectMitosisOS / mitosis-core

An OS kernel module for fast **remote** fork using advanced datacenter networking (RDMA).
MIT License
57 stars 5 forks source link

编译过程,rust-kernel-module文件出现问题 #4

Open ShuguiW opened 3 months ago

ShuguiW commented 3 months ago

作者您好,当我执行make km,其编译rust-kernel-module时,出现错误,其报错信息如下,我自己找到build.rs文件,在其中添加打印语句,发现报错位置貌似发生在let bindings = builder.generate().expect("Unable to generate bindings");这一句。而且报错会发生在编译rust-kernel-linux-util时,报错语句也是相同的位置,希望您能给出一些建议指明报错原因以及如何修改。 (base) crow@crow-H310M-T-PRO:~/mitosis-core$ make km cd mitosis-kms ; python build.py fork

Caused by: process didn't exit successfully: /home/crow/mitosis-core/mitosis-kms/fork/../target/debug/build/linux-kernel-module-e361eb8a63f7b057/build-script-build (exit status: 101) --- stdout cargo:rerun-if-env-changed=CC cargo:rerun-if-env-changed=KDIR cargo:rerun-if-env-changed=c_flags cargo:rerun-if-changed=src/bindings_helper.h cargo:rerun-if-changed=src/inline_helper.h rust-kernel-module/186 rust-kernel-module/190 rust-kernel-module/197 set opaque type:desc_struct set opaque type:xregs_state rust-kernel-module/200

--- stderr /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:17:9: error: unknown type name 'kernel_ino_t' /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:18:9: error: unknown type name 'kernel_mode_t' /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:21:9: error: unknown type name 'kernel_off_t' /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:22:9: error: unknown type name 'kernel_pid_t' /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:23:9: error: unknown type name 'kernel_daddr_t' /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:25:9: error: unknown type name 'kernel_suseconds_t' /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:26:9: error: unknown type name 'kernel_timer_t' /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:27:9: error: unknown type name 'kernel_clockid_t' /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:32:9: error: unknown type name 'kernel_uid32_t' /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:33:9: error: unknown type name 'kernel_gid32_t' /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:34:9: error: unknown type name 'kernel_uid16_t' /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:35:9: error: unknown type name 'kernel_gid16_t' /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:41:9: error: unknown type name 'kernel_old_uid_t' /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:42:9: error: unknown type name '__kernel_old_gid_t' /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:46:9: error: unknown type name 'kernel_loff_t' /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:55:9: error: unknown type name 'kernel_size_t' /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:60:9: error: unknown type name 'kernel_ssize_t' /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:65:9: error: unknown type name 'kernel_ptrdiff_t' /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:70:9: error: unknown type name 'kernel_time_t' fatal error: too many errors emitted, stopping now [-ferror-limit=] /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:17:9: error: unknown type name 'kernel_ino_t', err: true /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:18:9: error: unknown type name 'kernel_mode_t', err: true /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:21:9: error: unknown type name 'kernel_off_t', err: true /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:22:9: error: unknown type name 'kernel_pid_t', err: true /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:23:9: error: unknown type name 'kernel_daddr_t', err: true /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:25:9: error: unknown type name 'kernel_suseconds_t', err: true /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:26:9: error: unknown type name 'kernel_timer_t', err: true /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:27:9: error: unknown type name 'kernel_clockid_t', err: true /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:32:9: error: unknown type name 'kernel_uid32_t', err: true /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:33:9: error: unknown type name 'kernel_gid32_t', err: true /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:34:9: error: unknown type name 'kernel_uid16_t', err: true /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:35:9: error: unknown type name 'kernel_gid16_t', err: true /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:41:9: error: unknown type name 'kernel_old_uid_t', err: true /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:42:9: error: unknown type name '__kernel_old_gid_t', err: true /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:46:9: error: unknown type name 'kernel_loff_t', err: true /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:55:9: error: unknown type name 'kernel_size_t', err: true /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:60:9: error: unknown type name 'kernel_ssize_t', err: true /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:65:9: error: unknown type name 'kernel_ptrdiff_t', err: true /lib/modules/4.15.0-46-generic/build/./include/linux/types.h:70:9: error: unknown type name 'kernel_time_t', err: true fatal error: too many errors emitted, stopping now [-ferror-limit=], err: true thread 'main' panicked at 'Unable to generate bindings: ()', /home/crow/mitosis-core/deps/krcore/rust-kernel-rdma/deps/rust-kernel-module/build.rs:203:39 note: run with RUST_BACKTRACE=1 environment variable to display a backtrace warning: build failed, waiting for other jobs to finish... error: build failed /home/crow/mitosis-core/mitosis-kms/Kbuild:11: recipe for target '/home/crow/mitosis-core/mitosis-kms/target/x86_64-unknown-none-linuxkernel/debug/libfork.a' failed make[3]: [/home/crow/mitosis-core/mitosis-kms/target/x86_64-unknown-none-linuxkernel/debug/libfork.a] Error 101 Makefile:1551: recipe for target 'module/home/crow/mitosis-core/mitosis-kms' failed make[2]: [module/home/crow/mitosis-core/mitosis-kms] Error 2 make[2]: Leaving directory '/usr/src/linux-headers-4.15.0-46-generic' Makefile:38: recipe for target 'all' failed make[1]: [all] Error 2 make[1]: Leaving directory '/home/crow/mitosis-core/mitosis-kms' Traceback (most recent call last): File "/home/crow/mitosis-core/mitosis-kms/build.py", line 36, in main(sys.argv) File "/home/crow/mitosis-core/mitosis-kms/build.py", line 28, in main run( File "/home/crow/mitosis-core/mitosis-kms/build.py", line 15, in run subprocess.check_call(list(args), cwd=cwd, env=environ) File "/home/crow/miniconda3/lib/python3.12/subprocess.py", line 413, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['make', '-C', '/home/crow/mitosis-core/mitosis-kms', 'TEST_NAME=fork', 'TEST_PATH=fork']' returned non-zero exit status 2. makefile:12: recipe for target 'km' failed make: [km] Error 1

wxdwfc commented 3 months ago

您好,我看了下您的环境似乎没问题;能否确认下,软件环境(kernel-header,rust版本和clang版本)是否按照readme里的进行安装? 如果确认无误的话,可以试下rm mitosis-kms/.cache.mk 再进行build下。

ShuguiW commented 3 months ago

非常感谢,根据您的建议,make km编译成功。然后我尝试了后续remote fork进程的例子,遇到了几个小问题还想向您请教。执行make insmod 命令时出现了如下错误: (base) crow@crow-H310M-T-PRO:~ /mitosis-core$ make insmod sudo rmmod fork ; sudo insmod mitosis-kms/fork.ko mac_id=0 [sudo] password for crow: rmmod: ERROR: Module fork is not currently loaded Segmentation fault (core dumped) makefile:15: recipe for target 'insmod' failed make: [insmod] Error 139 我不清楚这个段错误来源于哪里,因为这个错误,没有成功生成/dev/mitosis-syscalls文件。然后我执行make rmmod,但是又出现了如下错误。 (base) crow@crow-H310M-T-PRO:~ /mitosis-core$ make rmmod sudo rmmod fork rmmod: ERROR: Module fork is in use makefile:18: recipe for target 'rmmod' failed make: [rmmod] Error 1 因此,不出意外,当我尝试连接另一台机器的时候,连接失败。 (base) crow@crow-H310M-T-PRO:~ /mitosis-core/exp$ ./connector -gid="fe80:0000:0000:0000:1270:fdff:fe39:0e7a" -mac_id=0 -nic_id=0 connect res: -1 下面是我show_gids执行的结果以及ibstatus的结果: (base) crow@crow-H310M-T-PRO:~ /mitosis-core$ show_gids DEV   PORT  INDEX GID                           IPv4             VER   DEV ---   ----  ----- ---                           ------------     ---   --- mlx5_0      1     0     fe80:0000:0000:0000:1270:fdff:fe39:0e92               v1    enp1s0f0 mlx5_0      1     1     fe80:0000:0000:0000:1270:fdff:fe39:0e92               v2    enp1s0f0 mlx5_1      1     0     fe80:0000:0000:0000:1270:fdff:fe39:0e93               v1    enp1s0f1 mlx5_1      1     1     fe80:0000:0000:0000:1270:fdff:fe39:0e93               v2    enp1s0f1 n_gids_found=4 (base) crow@crow-H310M-T-PRO:~ /mitosis-core$ ibstatus Infiniband device 'mlx5_0' port 1 status:       default gid:       fe80:0000:0000:0000:1270:fdff:fe39:0e92       base lid:    0x0       sm lid:            0x0       state:             4: ACTIVE       phys state:  5: LinkUp       rate:        100 Gb/sec (4X EDR)       link_layer:  Ethernet

Infiniband device 'mlx5_1' port 1 status:       default gid:       fe80:0000:0000:0000:1270:fdff:fe39:0e93       base lid:    0x0       sm lid:            0x0       state:             1: DOWN       phys state:  3: Disabled       rate:        40 Gb/sec (4X QDR)       link_layer:  Ethernet 这个连接失败的原因应该就是我前面没能生成/dev/mitosis-syscalls文件导致的。希望您能提供一些修改意见,十分感谢。 另外,我还尝试了微基准的测试,其打印结果貌似也比较奇怪,如下所示: (mitosis) crow@crow-H310M-T-PRO:~/mitosis-core/exp_scripts$ make micro-c-prepare rm -rf out/micro-c-prepare python toml_generator.py -f templates-run/micro-c/template-run-micro-prepare.toml -o out/micro-c-prepare -d "{ 'pwd':'218','user':'crow', 'hosts':{'builder':['crow-H310M-T-PRO',] , 'parent':['crow-H310M-T-PRO'], 'child':[], },'path':'projects/mos', 'placeholder': {'parent_gid': 'fe80:0000:0000:0000:1270:fdff:fe39:0e92', 'parent_host': 'crow-H310M-T-PRO', 'child_hosts': ''} } " creating toml output dir out/micro-c-prepare python evaluation_runner.py --input out/micro-c-prepare --arguments="-k=" --filter="Prepare" trace 1048576

finish

run-1048576.toml trace 1073741824

finish

run-1073741824.toml trace 134217728

finish

run-134217728.toml

wxdwfc commented 3 months ago

您好,看上去是kernel module 没加载成功。能否看下dmesg 的报错信息?

wxdwfc commented 3 months ago

另外,看到您用的是ROCE。 我们由于机器问题,没测试过ROCE,建议使用IB网络进行实验。

ShuguiW commented 3 months ago

您好,下面是dmesg的报错信息,由于信息较多,下面还包括一个word文件,里面将报错位置标红,便于查看。另外根据您的提示,我尝试将link_layer换成IB网络,但是没有更换成功,貌似是当前环境并不支持我更换传输模式。最后,我测试了以太网工作的联通性。我期望能够经过一些配置上的更改成功运行您的测试例子,或者您是否能够给出一些代码上修改的提示以支持在ROCE上运行。期待您的反馈,诚心感谢。 dmesg打印信息.docx [ 1231.613811] perf: interrupt took too long (2516 > 2500), lowering kernel.perf_event_max_sample_rate to 79250 [ 1630.526382] perf: interrupt took too long (3166 > 3145), lowering kernel.perf_event_max_sample_rate to 63000 [ 2191.290286] perf: interrupt took too long (3958 > 3957), lowering kernel.perf_event_max_sample_rate to 50500 [ 4541.191119] perf: interrupt took too long (4951 > 4947), lowering kernel.perf_event_max_sample_rate to 40250 [ 7368.180582] src/lib.rs@29: [INFO ] - Remote fork kernel module assigned ID=0 [ 7368.180585] /home/crow/mitosis-core/mitosis/src/startup.rs@49: [INFO ] - Try to start MITOSIS instance, init global services [ 7368.180586] /home/crow/mitosis-core/mitosis/src/startup.rs@15: [INFO ] - [check]: use on-demand resume mode. [ 7368.180587] /home/crow/mitosis-core/mitosis/src/startup.rs@19: [INFO ] - [check]: Parent is using copy-on-write (COW) mode. [ 7368.180588] /home/crow/mitosis-core/mitosis/src/startup.rs@25: [INFO ] - [check]: Prefetch optimization is enabled, prefetch sz 1. [ 7368.180589] /home/crow/mitosis-core/mitosis/src/startup.rs@36: [INFO ] - [check]: Not cache remote page table. [ 7368.180590] /home/crow/mitosis-core/mitosis/src/startup.rs@42: [INFO ] - [check]: Use RDMA's dynamic connected transport for communications. [ 7368.180591] /home/crow/mitosis-core/mitosis/src/startup.rs@45: [INFO ] - All configuration check passes ! [ 7368.180817] rust-kernel-rdma-base: enabling unsafe global rkey [ 7368.488917] buf info: panicked at 'should not fail: Creation(-22)', /home/crow/mitosis-core/mitosis/src/rdma_context.rs:50:66 [ 7368.488932] ------------[ cut here ]------------ [ 7368.488934] kernel BUG at src/helpers.c:14! [ 7368.488944] invalid opcode: 0000 [#1] SMP PTI [ 7368.488947] Modules linked in: fork(OE+) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) ib_uverbs(OE) mlx4_ib(OE) ib_core(OE) mlx4_en(OE) mlx4_core(OE) binfmt_misc nls_iso8859_1 kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd input_leds joydev video acpi_pad knem(OE) parport_pc ppdev lp parport autofs4 hid_generic usbhid hid mlx5_core(OE) r8101(OE) ahci mlx_compat(OE) libahci mlxfw(OE) devlink ptp pps_core [ 7368.488997] CPU: 1 PID: 4844 Comm: insmod Tainted: G OE 4.15.0-46-generic #49-Ubuntu [ 7368.488999] Hardware name: Colorful Technology And Development Co.,LTD H310M-T PRO/H310M-T PRO, BIOS 5.12 05/16/2019 [ 7368.489264] RIP: 0010:bug_helper+0x9/0x20 [fork] [ 7368.489267] RSP: 0018:ffffab8002d56720 EFLAGS: 00010282 [ 7368.489271] RAX: 0000000000000071 RBX: ffffab8002d56738 RCX: 0000000000000000 [ 7368.489274] RDX: 0000000000000000 RSI: ffff9a036ed16498 RDI: ffff9a036ed16498 [ 7368.489276] RBP: ffffab8002d56720 R08: 0000000000000001 R09: 0000000000000344 [ 7368.489278] R10: ffffab8002d56580 R11: 0000000000000000 R12: 0000000000000000 [ 7368.489281] R13: ffff9a02ae9d1720 R14: ffffab8002d56c80 R15: 0000000000000001 [ 7368.489284] FS: 00007fe225653540(0000) GS:ffff9a036ed00000(0000) knlGS:0000000000000000 [ 7368.489287] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 7368.489290] CR2: 0000557a053811c0 CR3: 00000001814ca002 CR4: 00000000003606e0 [ 7368.489292] Call Trace: [ 7368.489570] rust_begin_unwind+0x91/0xa0 [fork] [ 7368.489578] ? switch_to_asm+0x34/0x70 [ 7368.489604] ? rdma_port_get_link_layer+0x1e/0x50 [ib_core] [ 7368.489879] ? ZN4core5slice29$LT$impl$u20$$u5b$T$u5d$$GT$15copy_from_slice17hf82914350afa714dE+0x80/0x80 [fork] [ 7368.490107] _ZN4core9panicking9panic_fmt17he24d6cc5a36dd1dbE+0x2d/0x30 [fork] [ 7368.490365] ? _ZN8KRdmaKit11queue_pairs27dynamic_connected_transport22DynamicConnectedTarget17get_datagram_meta17ha37b4d2717f1645dE+0x31/0xf0 [fork] [ 7368.490586] _ZN4core6result13unwrap_failed17h93cdea133055b12cE+0x6c/0x70 [fork] [ 7368.490803] ? ZN44$LT$$RF$T$u20$as$u20$core..fmt..Display$GT$3fmt17h0feed2dd3958df8bE+0x20/0x20 [fork] [ 7368.491022] ? ZN42$LT$$RF$T$u20$as$u20$core..fmt..Debug$GT$3fmt17h884188d77116d6bbE+0x100/0x100 [fork] [ 7368.491233] _ZN7mitosis12rdma_context10start_rdma17h698be4bafd69e08cE+0xa63/0xb10 [fork] [ 7368.491458] ? ZN42$LT$$RF$T$u20$as$u20$core..fmt..Debug$GT$3fmt17hf9cb0ff4e10367dcE+0x10/0x10 [fork] [ 7368.491675] ? ZN4core3fmt3num3imp52$LT$impl$u20$core..fmt..Display$u20$for$u20$i32$GT$3fmt17h840ec10b2de74afeE+0x20/0x20 [fork] [ 7368.491895] ? ZN42$LT$$RF$T$u20$as$u20$core..fmt..Debug$GT$3fmt17hf9cb0ff4e10367dcE+0x10/0x10 [fork] [ 7368.492122] ? ZN4core3fmt3num3imp52$LT$impl$u20$core..fmt..Display$u20$for$u20$i32$GT$3fmt17h840ec10b2de74afeE+0x20/0x20 [fork] [ 7368.492350] _ZN7mitosis7startup12init_mitosis17hcead06ac6449721eE+0x167/0xdf0 [fork] [ 7368.492591] ? ZN42$LT$$RF$T$u20$as$u20$core..fmt..Debug$GT$3fmt17hf9cb0ff4e10367dcE+0x10/0x10 [fork] [ 7368.492797] ? ZN4core3fmt3num3imp52$LT$impl$u20$core..fmt..Display$u20$for$u20$i32$GT$3fmt17h840ec10b2de74afeE+0x20/0x20 [fork] [ 7368.493074] ? ZN83$LT$rust_kernel_linux_util..level..Level$u20$as$u20$core..str..traits..FromStr$GT$8from_str17h30ed5e388135041cE+0x270/0x270 [fork] [ 7368.493286] ? ZN57$LT$core..fmt..Arguments$u20$as$u20$core..fmt..Debug$GT$3fmt17hcc1dbed6d991ec2cE+0x60/0x60 [fork] [ 7368.493530] _ZN7mitosis7startup14start_instance17h29fb7b63e3619507E+0x31/0x250 [fork] [ 7368.493541] ? kmem_cache_alloc+0xa2/0x1b0 [ 7368.493546] ? mempool_alloc_slab+0x15/0x20 [ 7368.493551] ? wait_woken+0x80/0x80 [ 7368.493556] ? mempool_alloc_slab+0x15/0x20 [ 7368.493560] ? mempool_alloc+0x71/0x190 [ 7368.493564] ? mempool_alloc_slab+0x15/0x20 [ 7368.493569] ? mempool_alloc+0x71/0x190 [ 7368.493574] ? blk_rq_map_sg+0x13e/0x540 [ 7368.493785] ? _ZN4core3fmt9Formatter12pad_integral17hf8e301a155813e6cE+0x106/0x450 [fork] [ 7368.494037] ? ZN79$LT$linux_kernel_module..printk..LogLineWriter$u20$as$u20$core..fmt..Write$GT$9write_str17h0f01d8afb6abda8bE+0x96/0x150 [fork] [ 7368.494044] ? sched_clock+0x9/0x10 [ 7368.494048] ? sched_clock+0x9/0x10 [ 7368.494052] ? up+0x32/0x50 [ 7368.494058] ? irq_work_queue+0x99/0xa0 [ 7368.494062] ? console_unlock+0x2e5/0x4e0 [ 7368.494066] ? vprintk_emit+0x333/0x3a0 [ 7368.494255] ? _ZN4core3ptr61drop_in_place$LT$core..option..Option$LT$fork..Module$GT$$GT$17h8a82d34b02fe66e4E+0x170/0x170 [fork] [ 7368.494269] ? vprintk_default+0x29/0x50 [ 7368.494273] ? vprintk_func+0x27/0x60 [ 7368.494277] ? printk+0x52/0x6e [ 7368.494472] init_module+0x203/0x410 [fork] [ 7368.494684] ? ZN4core3fmt3num3imp52$LT$impl$u20$core..fmt..Display$u20$for$u20$i64$GT$3fmt17h9e3ac72c4fc3d8eaE+0x30/0x30 [fork] [ 7368.494691] ? __vunmap+0x71/0xb0 [ 7368.494891] ? _ZN4core3ptr61drop_in_place$LT$core..option..Option$LT$fork..Module$GT$$GT$17h8a82d34b02fe66e4E+0x170/0x170 [fork] [ 7368.494898] do_one_initcall+0x52/0x19f [ 7368.494903] ? vunmap+0x81/0xb0 [ 7368.494908] ? _cond_resched+0x19/0x40 [ 7368.494913] ? kmem_cache_alloc_trace+0xa6/0x1b0 [ 7368.494918] ? do_init_module+0x27/0x209 [ 7368.494922] do_init_module+0x5f/0x209 [ 7368.494927] load_module+0x191e/0x1f10 [ 7368.494932] ? ima_post_read_file+0x96/0xa0 [ 7368.494938] SYSC_finit_module+0xfc/0x120 [ 7368.494942] ? SYSC_finit_module+0xfc/0x120 [ 7368.494948] SyS_finit_module+0xe/0x10 [ 7368.494952] do_syscall_64+0x73/0x130 [ 7368.494957] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [ 7368.494961] RIP: 0033:0x7fe22516d539 [ 7368.494964] RSP: 002b:00007ffe5d031a18 EFLAGS: 00000206 ORIG_RAX: 0000000000000139 [ 7368.494968] RAX: ffffffffffffffda RBX: 0000556703c5d7a0 RCX: 00007fe22516d539 [ 7368.494970] RDX: 0000000000000000 RSI: 0000556703c5d260 RDI: 0000000000000003 [ 7368.494973] RBP: 0000556703c5d260 R08: 0000000000000000 R09: 0000000000000000 [ 7368.494975] R10: 0000000000000003 R11: 0000000000000206 R12: 0000000000000000 [ 7368.494978] R13: 0000556703c5fe50 R14: 0000000000000000 R15: 0000556703c5d260 [ 7368.494981] Code: 01 00 e8 7b a1 0f d0 55 48 89 e5 48 c7 c6 1f 68 95 c0 48 c7 c2 58 c7 99 c0 e8 24 5e 7d cf 5d c3 00 00 e8 5b a1 0f d0 55 48 89 e5 <0f> 0b 0f 1f 44 00 00 eb fe 00 00 00 00 00 00 00 00 00 00 00 00 __[ 7368.495274] RIP: bug_helper+0x9/0x20 [fork] RSP: ffffab8002d56720__ [ 7368.495312] ---[ end trace e2c3c428894af0f9 ]--- [ 7571.210530] mlx5_core 0000:01:00.0 enp1s0f0: Link down [ 7577.752104] mlx5_core 0000:01:00.0 enp1s0f0: Link up


下面是更换IB网络的相关信息,我使用了mstconfig来执行,但是发现没有LINK_PORT选项。 (base) ll@ll-System-Product-Name:~ /mitosis-core/exp$ lspci -v | grep Mellanox 01:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]       Subsystem: Mellanox Technologies Device 0008 01:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]       Subsystem: Mellanox Technologies Device 0008 (base) ll@ll-System-Product-Name:~ /mitosis-core/exp$ sudo mstconfig -d 01:00.0 q

Device #1:

Device type: ConnectX5
Name: MCX516A-CDA_Ax_Bx Description: ConnectX-5 Ex EN network interface card; 100GbE dual-port QSFP28; PCIe4.0 x16; tall bracket; ROHS R6 Device: 01:00.0

Configurations: Next Boot MEMIC_BAR_SIZE 0
MEMIC_SIZE_LIMIT _256KB(1)
HOST_CHAINING_MODE DISABLED(0)
HOST_CHAINING_DESCRIPTORS Array[0..7]
HOST_CHAINING_TOTAL_BUFFER_SIZE Array[0..7]
FLEX_PARSER_PROFILE_ENABLE 0
FLEX_IPV4_OVER_VXLAN_PORT 0
ROCE_NEXT_PROTOCOL 254
ESWITCH_HAIRPIN_DESCRIPTORS Array[0..7]
ESWITCH_HAIRPIN_TOT_BUFFER_SIZE Array[0..7]
PF_BAR2_SIZE 0
NON_PREFETCHABLE_PF_BAR False(0)
VF_VPD_ENABLE False(0)
STRICT_VF_MSIX_NUM False(0)
VF_NODNIC_ENABLE False(0)
NUM_OF_VFS 0
PF_BAR2_ENABLE False(0)
SRIOV_EN False(0)
PF_LOG_BAR_SIZE 5
VF_LOG_BAR_SIZE 1
NUM_PF_MSIX 63
NUM_VF_MSIX 11
INT_LOG_MAX_PAYLOAD_SIZE AUTOMATIC(0)
PARTIAL_RESET_EN False(0)
SW_RECOVERY_ON_ERRORS False(0)
RESET_WITH_HOST_ON_ERRORS False(0)
ADVANCED_POWER_SETTINGS False(0)
CQE_COMPRESSION BALANCED(0)
IP_OVER_VXLAN_EN False(0)
MKEY_BY_NAME False(0)
ESWITCH_IPV4_TTL_MODIFY_ENABLE False(0)
PRIO_TAG_REQUIRED_EN False(0)
UCTX_EN True(1)
PCI_ATOMIC_MODE PCI_ATOMIC_DISABLED_EXT_ATOMIC_ENABLED(0) TUNNEL_ECN_COPY_DISABLE False(0)
LRO_LOG_TIMEOUT0 6
LRO_LOG_TIMEOUT1 7
LRO_LOG_TIMEOUT2 8
LRO_LOG_TIMEOUT3 13
LOG_DCR_HASH_TABLE_SIZE 11
DCR_LIFO_SIZE 16384
ROCE_CC_PRIO_MASK_P1 255
ROCE_CC_ALGORITHM_P1 ECN(0)
ROCE_CC_PRIO_MASK_P2 255
ROCE_CC_ALGORITHM_P2 ECN(0)
CLAMP_TGT_RATE_AFTER_TIME_INC_P1 True(1)
CLAMP_TGT_RATE_P1 False(0)
RPG_TIME_RESET_P1 300
RPG_BYTE_RESET_P1 32767
RPG_THRESHOLD_P1 1
RPG_MAX_RATE_P1 0
RPG_AI_RATE_P1 5
RPG_HAI_RATE_P1 50
RPG_GD_P1 11
RPG_MIN_DEC_FAC_P1 50
RPG_MIN_RATE_P1 1
RATE_TO_SET_ON_FIRST_CNP_P1 0
DCE_TCP_G_P1 1019
DCE_TCP_RTT_P1 1
RATE_REDUCE_MONITOR_PERIOD_P1 4
INITIAL_ALPHA_VALUE_P1 1023
MIN_TIME_BETWEEN_CNPS_P1 4
CNP_802P_PRIO_P1 6
CNP_DSCP_P1 48
CLAMP_TGT_RATE_AFTER_TIME_INC_P2 True(1)
CLAMP_TGT_RATE_P2 False(0)
RPG_TIME_RESET_P2 300
RPG_BYTE_RESET_P2 32767
RPG_THRESHOLD_P2 1
RPG_MAX_RATE_P2 0
RPG_AI_RATE_P2 5
RPG_HAI_RATE_P2 50
RPG_GD_P2 11
RPG_MIN_DEC_FAC_P2 50
RPG_MIN_RATE_P2 1
RATE_TO_SET_ON_FIRST_CNP_P2 0
DCE_TCP_G_P2 1019
DCE_TCP_RTT_P2 1
RATE_REDUCE_MONITOR_PERIOD_P2 4
INITIAL_ALPHA_VALUE_P2 1023
MIN_TIME_BETWEEN_CNPS_P2 4
CNP_802P_PRIO_P2 6
CNP_DSCP_P2 48
LLDP_NB_DCBX_P1 False(0)
LLDP_NB_RX_MODE_P1 OFF(0)
LLDP_NB_TX_MODE_P1 OFF(0)
LLDP_NB_DCBX_P2 False(0)
LLDP_NB_RX_MODE_P2 OFF(0)
LLDP_NB_TX_MODE_P2 OFF(0)
DCBX_IEEE_P1 True(1)
DCBX_CEE_P1 True(1)
DCBX_WILLING_P1 True(1)
DCBX_IEEE_P2 True(1)
DCBX_CEE_P2 True(1)
DCBX_WILLING_P2 True(1)
KEEP_ETH_LINK_UP_P1 True(1)
KEEP_IB_LINK_UP_P1 True(1)
KEEP_LINK_UP_ON_BOOT_P1 False(0)
KEEP_LINK_UP_ON_STANDBY_P1 False(0)
DO_NOT_CLEAR_PORT_STATS_P1 False(0)
KEEP_ETH_LINK_UP_P2 True(1)
KEEP_IB_LINK_UP_P2 False(0)
KEEP_LINK_UP_ON_BOOT_P2 False(0)
KEEP_LINK_UP_ON_STANDBY_P2 False(0)
DO_NOT_CLEAR_PORT_STATS_P2 False(0)
NUM_OF_VL_P1 _4_VLs(3)
NUM_OF_TC_P1 _8_TCs(0)
NUM_OF_PFC_P1 8
NUM_OF_VL_P2 _4_VLs(3)
NUM_OF_TC_P2 _8_TCs(0)
NUM_OF_PFC_P2 8
DUP_MAC_ACTION_P1 LAST_CFG(0)
SRIOV_IB_ROUTING_MODE_P1 LID(1)
IB_ROUTING_MODE_P1 LID(1)
DUP_MAC_ACTION_P2 LAST_CFG(0)
SRIOV_IB_ROUTING_MODE_P2 LID(1)
IB_ROUTING_MODE_P2 LID(1)
PCI_WR_ORDERING per_mkey(0)
MULTI_PORT_VHCA_EN False(0)
PORT_OWNER True(1)
ALLOW_RD_COUNTERS True(1)
RENEG_ON_CHANGE True(1)
TRACER_ENABLE True(1)
IP_VER IPv4(0)
BOOT_UNDI_NETWORK_WAIT 0
UEFI_HII_EN True(1)
BOOT_DBG_LOG False(0)
UEFI_LOGS DISABLED(0)
BOOT_VLAN 1
LEGACY_BOOT_PROTOCOL PXE(1)
BOOT_RETRY_CNT NONE(0)
BOOT_INTERRUPT_DIS False(0)
BOOT_LACP_DIS True(1)
BOOT_VLAN_EN False(0)
BOOT_PKEY 0
ATS_ENABLED False(0)
DYNAMIC_VF_MSIX_TABLE False(0)
EXP_ROM_UEFI_ARM_ENABLE False(0)
EXP_ROM_UEFI_x86_ENABLE False(0)
EXP_ROM_PXE_ENABLE True(1)
ADVANCED_PCI_SETTINGS False(0)
SAFE_MODE_THRESHOLD 10
SAFE_MODE_ENABLE True(1)
(base) ll@ll-System-Product-Name:~ /mitosis-core/exp$ sudo mstconfig -d 01:00.0 q | grep LINK_PORT 并没有任何打印结果


接下来,我使用以太网进行传输测试,为网卡手动添加IP地址,测试是可以进行联通的。 (base) ll@ll-System-Product-Name:~ /mitosis-core/exp$ ibstatus Infiniband device 'mlx5_0' port 1 status:       default gid:       fe80:0000:0000:0000:1270:fdff:fe39:0e7a       base lid:    0x0       sm lid:            0x0       state:             4: ACTIVE       phys state:  5: LinkUp       rate:        100 Gb/sec (4X EDR)       link_layer:  Ethernet

Infiniband device 'mlx5_1' port 1 status:       default gid:       fe80:0000:0000:0000:1270:fdff:fe39:0e7b       base lid:    0x0       sm lid:            0x0       state:             1: DOWN       phys state:  3: Disabled       rate:        40 Gb/sec (4X QDR)       link_layer:  Ethernet

(base) ll@ll-System-Product-Name:~ /mitosis-core/exp$ show_gids DEV   PORT  INDEX GID                           IPv4             VER   DEV ---   ----  ----- ---                           ------------     ---   --- mlx5_0      1     0     fe80:0000:0000:0000:1270:fdff:fe39:0e7a               v1    enp1s0f0 mlx5_0      1     1     fe80:0000:0000:0000:1270:fdff:fe39:0e7a               v2    enp1s0f0 mlx5_0      1     2     0000:0000:0000:0000:0000:ffff:c0a8:0101   192.168.1.1      v1    enp1s0f0 mlx5_0      1     3     0000:0000:0000:0000:0000:ffff:c0a8:0101   192.168.1.1      v2    enp1s0f0 mlx5_1      1     0     fe80:0000:0000:0000:1270:fdff:fe39:0e7b               v1    enp1s0f1 mlx5_1      1     1     fe80:0000:0000:0000:1270:fdff:fe39:0e7b               v2    enp1s0f1 n_gids_found=6 (base) ll@ll-System-Product-Name:~ /mitosis-core/exp$ ib_send_bw -d mlx5_0


(base) crow@crow-H310M-T-PRO:~ /mitosis-core/exp$ ib_send_bw -d mlx5_0 192.168.1.1

                Send BW Test

Dual-port : OFF        Device : mlx5_0 Number of qps : 1          Transport type : IB Connection type : RC         Using SRQ : OFF TX depth : 128 CQ Moderation : 1 Mtu : 1024[B] Link type : Ethernet GID index : 3 Max inline data : 0[B] rdma_cm QPs       : OFF Data ex. method : Ethernet

local address: LID 0000 QPN 0x0051 PSN 0x9b7ca3 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:01:02 remote address: LID 0000 QPN 0x0047 PSN 0xd80157 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:01:01

bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]

65536 1000 10751.53 6175.40           0.098806

最后,我发现明明make insmod出现了错误,但是fork模块还是可以被列出。希望这个信息可以辅助您进行思考。文件/dev/mitosis-syscalls没有被建立,我始终觉得这是一个比较关键的错误。 (base) crow@crow-H310M-T-PRO:~ /mitosis-core$ make insmod sudo rmmod fork ; sudo insmod mitosis-kms/fork.ko mac_id=0 [sudo] password for crow: rmmod: ERROR: Module fork is not currently loaded Segmentation fault (core dumped) makefile:15: recipe for target 'insmod' failed make: *** [insmod] Error 139 (base) crow@crow-H310M-T-PRO:~ /mitosis-core$ lsmod Module Size Used by fork 3235840 1 rdma_ucm 28672 0 ib_ucm 20480 0 rdma_cm 57344 1 rdma_ucm iw_cm 45056 1 rdma_cm ib_ipoib 176128 0 ib_cm 53248 4 rdma_cm,ib_ipoib,fork,ib_ucm ib_umad 24576 0 mlx5_ib 393216 0 ib_uverbs 131072 3 rdma_ucm,mlx5_ib,ib_ucm mlx4_ib 221184 0 ib_core 323584 11 rdma_cm,ib_ipoib,mlx4_ib,fork,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,ib_ucm mlx4_en 139264 0 mlx4_core 335872 2 mlx4_ib,mlx4_en binfmt_misc 20480 1 nls_iso8859_1 16384 1 kvm_intel 212992 0 kvm 598016 1 kvm_intel irqbypass 16384 1 kvm crct10dif_pclmul 16384 0 crc32_pclmul 16384 0 ghash_clmulni_intel 16384 0 pcbc 16384 0 aesni_intel 188416 0 aes_x86_64 20480 1 aesni_intel crypto_simd 16384 1 aesni_intel glue_helper 16384 1 aesni_intel cryptd 24576 3 crypto_simd,ghash_clmulni_intel,aesni_intel input_leds 16384 0 joydev 24576 0 video 45056 0 acpi_pad 180224 0 knem 36864 0 parport_pc 36864 0 ppdev 20480 0 lp 20480 0 parport 49152 3 parport_pc,lp,ppdev autofs4 40960 2 hid_generic 16384 0 usbhid 49152 0 hid 118784 2 usbhid,hid_generic mlx5_core 1040384 1 mlx5_ib r8101 196608 0 ahci 40960 3 mlx_compat 40960 14 rdma_cm,ib_ipoib,mlx4_core,mlx4_ib,iw_cm,ib_umad,mlx4_en,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core,ib_ucm libahci 32768 1 ahci mlxfw 20480 1 mlx5_core devlink 45056 4 mlx4_core,mlx4_ib,mlx4_en,mlx5_core ptp 20480 2 mlx4_en,mlx5_core pps_core 20480 1 ptp (base) crow@crow-H310M-T-PRO:~ /mitosis-core$ file /dev/mitosis-syscalls /dev/mitosis-syscalls: cannot open `/dev/mitosis-syscalls' (No such file or directory)

wxdwfc commented 3 months ago

您好,我看了下dmesg,是[ 589.589922] buf info: panicked at 'should not fail: Creation(-22)', /home/ll/mitosis-core/mitosis/src/rdma_context.rs:50:66 中报的错,原因是DCT创建失败。我不大清楚你的网卡是否支持DCT,可以check下。

如果不需要DCT这个特性,可以在kbuild里面用一下use_rc的选项:如使用

https://github.com/ProjectMitosisOS/mitosis-core/blob/main/mitosis-kms/Kbuild-mitosis-use-rc

这个kbuild试试(具体怎么用请参考下README)。

如果还不行的话,只能尝试换下IB的卡了(我看了下你的卡不支持IB),这个应该最方便。

ps:如果出现kernel panic的话,我建议重新启动下机器,不然会出现undefined behavior。

ShuguiW commented 3 months ago

好的,非常感谢您的帮助。应该确实是我硬件的问题,前段时间多有打扰了。再次致谢!