Terabit-Ethernet / dcPIM

BSD 3-Clause "New" or "Revised" License
21 stars 6 forks source link

dcpim_module.ko loading error. #27

Closed bgautham4 closed 2 months ago

bgautham4 commented 2 months ago

Hello @qizhe , I have followed the instructions given within the README.md for the kernel_impl => using ubuntu20.04 with kernel version 6.0.3 on a virtual machine. The kernel and module builds with no errors, however when loading the module using: sudo insmod dcpim_module.ko I simply get Killed as the output.

Here is the output from dmesg:

[   76.504752] dcpim_module: loading out-of-tree module taints kernel.
[   76.504820] dcpim_module: module verification failed: signature and/or required key missing - tainting kernel
[   76.508584] DCPIM module loading
[   76.509609] BUG: unable to handle page fault for address: 00003e60c3d30500
[   76.509635] #PF: supervisor read access in kernel mode
[   76.509647] #PF: error_code(0x0000) - not-present page
[   76.509659] PGD 0 P4D 0 
[   76.509668] Oops: 0000 [#1] PREEMPT SMP NOPTI
[   76.509681] CPU: 10 PID: 1853 Comm: insmod Tainted: G           OE      6.0.3 #1
[   76.509696] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 3.20230228-4 06/06/2023
[   76.509713] RIP: 0010:__queue_work+0x9e/0x430
[   76.509727] Code: c5 e0 8a 6d a3 4a 63 04 30 83 f8 ff 0f 84 c8 02 00 00 48 83 c0 22 4c 89 ef 49 8b 1c c4 e8 ca f7 ff ff 48 89 c7 48 85 c0 75 38 <48> 8b 3b eb 3c 81 7d d4 00 20 00 00 0f 84 e3 01 00 00 49 63 c7 4d
[   76.509755] RSP: 0018:ffffaaa503df7bf0 EFLAGS: 00010046
[   76.509767] RAX: 0000000000000000 RBX: 00003e60c3d30500 RCX: 0000000000000000
[   76.509781] RDX: ffff8c3cdb9be500 RSI: 000000007fffffff RDI: 0000000000000000
[   76.509795] RBP: ffffaaa503df7c30 R08: 00003e611fc07500 R09: 0000000000000000
[   76.509809] R10: ffffffffa4101960 R11: 0000000000000010 R12: ffff8c3ccc054400
[   76.509822] R13: ffffffffc0882628 R14: 000000000002db2c R15: 000000000000003c
[   76.509837] FS:  00007f4d30696740(0000) GS:ffff8c43dfc80000(0000) knlGS:0000000000000000
[   76.509851] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   76.509863] CR2: 00003e60c3d30500 CR3: 000000011cd46000 CR4: 0000000000350ee0
[   76.509880] Call Trace:
[   76.509888]  <TASK>
[   76.509897]  queue_work_on+0x3e/0x50
[   76.509909]  dcpim_epoch_init+0x350/0x360 [dcpim_module]
[   76.509927]  ? dcpimv4_offload_init+0x20/0x20 [dcpim_module]
[   76.509944]  dcpim_load+0x144/0xf85 [dcpim_module]
[   76.509958]  ? dcpimv4_offload_init+0x20/0x20 [dcpim_module]
[   76.509972]  do_one_initcall+0x48/0x210
[   76.509984]  ? kmem_cache_alloc_trace+0x3d9/0x460
[   76.509998]  do_init_module+0x52/0x210
[   76.510008]  load_module+0x1ee8/0x20a0
[   76.510018]  ? ima_post_read_file+0xd2/0xe0
[   76.510034]  __do_sys_finit_module+0xc8/0x140
[   76.510044]  ? __do_sys_finit_module+0xc8/0x140
[   76.510057]  __x64_sys_finit_module+0x1a/0x20
[   76.510067]  do_syscall_64+0x3b/0x90
[   76.510078]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[   76.510090] RIP: 0033:0x7f4d3012695d
[   76.510100] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 03 35 0d 00 f7 d8 64 89 01 48
[   76.510129] RSP: 002b:00007ffec0650828 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[   76.510144] RAX: ffffffffffffffda RBX: 0000558b93924790 RCX: 00007f4d3012695d
[   76.510158] RDX: 0000000000000000 RSI: 0000558b9256e358 RDI: 0000000000000003
[   76.510171] RBP: 0000000000000000 R08: 0000000000000000 R09: 00007f4d301fd580
[   76.510184] R10: 0000000000000003 R11: 0000000000000246 R12: 0000558b9256e358
[   76.510197] R13: 0000000000000000 R14: 0000558b93924750 R15: 0000000000000000
[   76.510211]  </TASK>
[   76.510219] Modules linked in: dcpim_module(OE+) nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua binfmt_misc intel_rapl_msr intel_rapl_common sch_fq_codel kvm_amd ccp joydev input_leds kvm serio_raw mac_hid qemu_fw_cfg msr ramoops reed_solomon efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid bochs drm_vram_helper drm_ttm_helper ttm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel drm_kms_helper syscopyarea aesni_intel sysfillrect sysimgblt fb_sys_fops crypto_simd cryptd virtio_net drm psmouse net_failover failover virtio_scsi i2c_piix4 pata_acpi floppy
[   76.511970] CR2: 00003e60c3d30500
[   76.512318] ---[ end trace 0000000000000000 ]---
[   76.906098] RIP: 0010:__queue_work+0x9e/0x430
[   76.906656] Code: c5 e0 8a 6d a3 4a 63 04 30 83 f8 ff 0f 84 c8 02 00 00 48 83 c0 22 4c 89 ef 49 8b 1c c4 e8 ca f7 ff ff 48 89 c7 48 85 c0 75 38 <48> 8b 3b eb 3c 81 7d d4 00 20 00 00 0f 84 e3 01 00 00 49 63 c7 4d
[   76.907412] RSP: 0018:ffffaaa503df7bf0 EFLAGS: 00010046
[   76.907778] RAX: 0000000000000000 RBX: 00003e60c3d30500 RCX: 0000000000000000
[   76.908136] RDX: ffff8c3cdb9be500 RSI: 000000007fffffff RDI: 0000000000000000
[   76.908493] RBP: ffffaaa503df7c30 R08: 00003e611fc07500 R09: 0000000000000000
[   76.908872] R10: ffffffffa4101960 R11: 0000000000000010 R12: ffff8c3ccc054400
[   76.909214] R13: ffffffffc0882628 R14: 000000000002db2c R15: 000000000000003c
[   76.909557] FS:  00007f4d30696740(0000) GS:ffff8c43dfc80000(0000) knlGS:0000000000000000
[   76.909902] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   76.910246] CR2: 00003e60c3d30500 CR3: 000000011cd46000 CR4: 0000000000350ee0
qizhe commented 2 months ago

Hi,

The current implementation and testing of dcPIM is on the bare-metal machine. We haven't started evaluated under the VM setup.

bgautham4 commented 2 months ago

Thanks for the quick reply! I also tried building this module on a bare-metal machine running ubuntu20.04 with kernel version6.0.3 on the following hardware:

qizhe commented 2 months ago

I see. I think the issue relates to the line 470: epoch->cpu = 60; in dcpim_matching.c. I assume you don't have a core 60 in your server? You can change this number to the CPU core you want to run the matching logic on. Ideally, the matching CPU core should be in the same NUMA node (or CPU socket) as the application core. In a later version, I will make this a tunable parameter so you won't need to change the code, and I will include this information in the README.

bgautham4 commented 2 months ago

Thanks! I am now able to load the kernel module on both the VM and bare metal. Will close the issue now.