AIFM-sys / AIFM

AIFM: High-Performance, Application-Integrated Far Memory
MIT License
104 stars 34 forks source link

Met with issues when starting the tcp_device_server. #1

Closed albertghtoun closed 3 years ago

albertghtoun commented 3 years ago

Hello,

I am trying to re-produce the results for figure 6a using the code under /exp/fig6a/linux_mem/, however, I cannot successfully run the run_mem_server bash function and it turns out later that the it is due to the tcp_device_server was not started successfully. I tried to run the tcp_device_server on the remote memory node without involving the ssh by leveraging the following command. Then I got the failed to map ingress region issue. Can anyone help me getting out of this problem? Thanks!

albertgh@node-1:~/AIFM/aifm/exp$ ulimit -s 65536; /users/albertgh/AIFM/aifm/bin/tcp_device_server /users/albertgh/AIFM/aifm/configs/server.config 8000 CPU 09| <5> cpu: detected 20 cores, 1 nodes CPU 09| <5> time: detected 2394 ticks / us [ 0.000763] CPU 09| <5> loading configuration from '/users/albertgh/AIFM/aifm/configs/server.config' [ 0.000808] CPU 09| <3> < 1 guaranteed kthreads is not recommended for networked apps [ 0.000849] CPU 09| <2> control_setup: failed to map ingress region

zainryan commented 3 years ago

Hi! I just retried the script in my servers and it works fine. I guess you have another tcp_device_server process running at your memory node which prevents the new one from mapping the reserved hugepages. Please kill all existing tcp_device_server processes (or simply reboot your server). In addition, you have to use root privilege (sudo or in root) to run tcp_device_server.

Please let me whether it works.

albertghtoun commented 3 years ago

Hi!

Thanks for the response! I have tried your method by rebooting the memory machine and re-configured the shenango by running

sudo ./scripts/setup_machine.sh

after reboot. I also changed my command to use sudo like this:

ulimit -s 65536; sudo /users/albertgh/AIFM/aifm/bin/tcp_device_server /users/albertgh/AIFM/aifm/configs/server.config 8000

However, the problem is still the same as before and the runtime failed to start. What can the other possible reason for this problem? If you need more information from me to help re-produce this problem, please let me know. Thanks!

Best

albertghtoun commented 3 years ago

Hi,

I think I just solved the previous problem. It was because I also need to re-start the iokerneld after reboot in the memory node. Now a new problem arise:

albertgh@node-1:~/AIFM/aifm/exp/fig6a/linux_mem$ ulimit -s 65536; /users/albertgh/AIFM/aifm/bin/tcp_device_server /users/albertgh/AIFM/aifm/configs/server.config 8000
CPU 05| <5> cpu: detected 20 cores, 1 nodes
CPU 05| <5> time: detected 2394 ticks / us
[  0.000678] CPU 05| <5> loading configuration from '/users/albertgh/AIFM/aifm/configs/server.config'
[  0.000722] CPU 05| <3> < 1 guaranteed kthreads is not recommended for networked apps
[  0.019179] CPU 05| <5> net: started network stack
[  0.019194] CPU 05| <5> net: using the following configuration:
[  0.019196] CPU 05| <5>   addr:        128.110.154.114
[  0.019202] CPU 05| <5>   netmask:     255.255.252.0
[  0.019206] CPU 05| <5>   gateway:     128.110.152.1
[  0.019210] CPU 05| <5>   mac: 72:36:D9:57:5E:8C
[  0.252468] CPU 05| <2> mlx5_init: Couldn't get context for mlx5_3 (errno 9)
failed to start runtime

I checked the output of the following command about the status of my RDMA device info:

ibv_devinfo
hca_id: mlx5_3
        transport:                      InfiniBand (0)
        fw_ver:                         14.18.2030
        node_guid:                      9cdc:71ff:ff56:8f45
        sys_image_guid:                 9cdc:71ff:ff56:8f44
        vendor_id:                      0x02c9
        vendor_part_id:                 4117
        hw_ver:                         0x0
        board_id:                       HP_2420110034
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

It looks good to me. Before this, I Installed Mellanox OFED dependency as described in the README.md. Do you have any suggestions? Thanks!

zainryan commented 3 years ago

Hi, as I mentioned in the previous post, you have to launch the program in the root privilege, e.g., using sudo. Please let me know if you still encounter this error.

BTW, you can run the test script mentioned in the README to quickly check if your environment setting is correct or now. If everything passes, the error is very likely to be caused by the wrong commands. You can check the test script for details about how to launching things correctly.

albertghtoun commented 3 years ago

OK. Thanks! The problem was solved. Just as what you have mentioned.