Closed albertghtoun closed 3 years ago
Hi! I just retried the script in my servers and it works fine. I guess you have another tcp_device_server process running at your memory node which prevents the new one from mapping the reserved hugepages. Please kill all existing tcp_device_server processes (or simply reboot your server). In addition, you have to use root privilege (sudo or in root) to run tcp_device_server.
Please let me whether it works.
Hi!
Thanks for the response! I have tried your method by rebooting the memory machine and re-configured the shenango by running
sudo ./scripts/setup_machine.sh
after reboot. I also changed my command to use sudo like this:
ulimit -s 65536; sudo /users/albertgh/AIFM/aifm/bin/tcp_device_server /users/albertgh/AIFM/aifm/configs/server.config 8000
However, the problem is still the same as before and the runtime failed to start. What can the other possible reason for this problem? If you need more information from me to help re-produce this problem, please let me know. Thanks!
Best
Hi,
I think I just solved the previous problem. It was because I also need to re-start the iokerneld after reboot in the memory node. Now a new problem arise:
albertgh@node-1:~/AIFM/aifm/exp/fig6a/linux_mem$ ulimit -s 65536; /users/albertgh/AIFM/aifm/bin/tcp_device_server /users/albertgh/AIFM/aifm/configs/server.config 8000
CPU 05| <5> cpu: detected 20 cores, 1 nodes
CPU 05| <5> time: detected 2394 ticks / us
[ 0.000678] CPU 05| <5> loading configuration from '/users/albertgh/AIFM/aifm/configs/server.config'
[ 0.000722] CPU 05| <3> < 1 guaranteed kthreads is not recommended for networked apps
[ 0.019179] CPU 05| <5> net: started network stack
[ 0.019194] CPU 05| <5> net: using the following configuration:
[ 0.019196] CPU 05| <5> addr: 128.110.154.114
[ 0.019202] CPU 05| <5> netmask: 255.255.252.0
[ 0.019206] CPU 05| <5> gateway: 128.110.152.1
[ 0.019210] CPU 05| <5> mac: 72:36:D9:57:5E:8C
[ 0.252468] CPU 05| <2> mlx5_init: Couldn't get context for mlx5_3 (errno 9)
failed to start runtime
I checked the output of the following command about the status of my RDMA device info:
ibv_devinfo
hca_id: mlx5_3
transport: InfiniBand (0)
fw_ver: 14.18.2030
node_guid: 9cdc:71ff:ff56:8f45
sys_image_guid: 9cdc:71ff:ff56:8f44
vendor_id: 0x02c9
vendor_part_id: 4117
hw_ver: 0x0
board_id: HP_2420110034
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
It looks good to me. Before this, I Installed Mellanox OFED dependency as described in the README.md. Do you have any suggestions? Thanks!
Hi, as I mentioned in the previous post, you have to launch the program in the root privilege, e.g., using sudo. Please let me know if you still encounter this error.
BTW, you can run the test script mentioned in the README to quickly check if your environment setting is correct or now. If everything passes, the error is very likely to be caused by the wrong commands. You can check the test script for details about how to launching things correctly.
OK. Thanks! The problem was solved. Just as what you have mentioned.
Hello,
I am trying to re-produce the results for figure 6a using the code under /exp/fig6a/linux_mem/, however, I cannot successfully run the run_mem_server bash function and it turns out later that the it is due to the tcp_device_server was not started successfully. I tried to run the tcp_device_server on the remote memory node without involving the ssh by leveraging the following command. Then I got the failed to map ingress region issue. Can anyone help me getting out of this problem? Thanks!
albertgh@node-1:~/AIFM/aifm/exp$ ulimit -s 65536; /users/albertgh/AIFM/aifm/bin/tcp_device_server /users/albertgh/AIFM/aifm/configs/server.config 8000 CPU 09| <5> cpu: detected 20 cores, 1 nodes CPU 09| <5> time: detected 2394 ticks / us [ 0.000763] CPU 09| <5> loading configuration from '/users/albertgh/AIFM/aifm/configs/server.config' [ 0.000808] CPU 09| <3> < 1 guaranteed kthreads is not recommended for networked apps [ 0.000849] CPU 09| <2> control_setup: failed to map ingress region