Closed MiraHyc closed 5 months ago
Hi MiraHyc, thanks for your attention!
The master node can be any node in the r650 cluster you build. You can change the master_ip parameter to the IP address of any node you like in the cluster.
In CloudLab, some r650 nodes can connect to other nodes without any settings. For nodes without this capability, you can generate an SSH key pair on the node using ssh-keygen
and copy the public key to all the other nodes. Then, the node can directly establish SSH connections to the other nodes without requiring a password.
Hopefully this information will help you! @MiraHyc
Thank you very much. Currently I try to use my own machine to reproduce the results.But a "mmap failed" happened when I try to conduct a ycsb test. I wonder how can this problem be solved. Thanks again for your time and patience!
图明 @.***
------------------ 原始邮件 ------------------ 发件人: "dmemsys/SMART" @.>; 发送时间: 2024年3月29日(星期五) 中午11:26 @.>; @.**@.>; 主题: Re: [dmemsys/SMART] How to set up the master node (Issue #12)
Hi MiraHyc, thanks for your attention!
The master node can be any node in the r650 cluster you build. You can change the master_ip parameter to the IP address of any node you like in the cluster.
In CloudLab, some r650 nodes can connect to other nodes without any settings. For nodes without this capability, you can generate an SSH key pair on the node using ssh-keygen and copy the public key to all the other nodes. Then, the node can directly establish SSH connections to the other nodes without requiring a password.
Hopefully this information will help you! @MiraHyc
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
Hi MiraHyc,
The "mmap failed" may happen when there is not enough amount of huge pages. To solve this, you can try the following command:
echo 36864 > /proc/sys/vm/nr_hugepages
Hopefully this information will help you! @MiraHyc
Yes. Thank you very much.
图明 @.***
------------------ 原始邮件 ------------------ 发件人: "dmemsys/SMART" @.>; 发送时间: 2024年5月8日(星期三) 晚上11:02 @.>; @.**@.>; 主题: Re: [dmemsys/SMART] How to set up the master node (Issue #12)
Hi MiraHyc,
The "mmap failed" may happen when there is not enough amount of huge pages. To solve this, you can try the following command: echo 36864 > /proc/sys/vm/nr_hugepages
Hopefully this information will help you! @MiraHyc
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
Sometimes when I am doing the test, some threads can't bind to cores and " can't bind core“ appears in the terminal appear togethet with other output. I wonder why this would happen.
图明 @.***
------------------ 原始邮件 ------------------ 发件人: "dmemsys/SMART" @.>; 发送时间: 2024年5月8日(星期三) 晚上10:55 @.>; @.**@.>; 主题: Re: [dmemsys/SMART] How to set up the master node (Issue #12)
Reopened #12.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
This may happen when binding to a core with an ID that exceeds the number of physical cores.
In our code, we calculate the ID of the last core using the CPU_PHYSICAL_CORE_NUM
parameter:
https://github.com/dmemsys/SMART/blob/1274148c849af987eb7e9815da007af8403d4155/src/Directory.cpp#L32
Thus, you should update the value of CPU_PHYSICAL_CORE_NUM
to the actual number of physical CPU cores of your own machine:
https://github.com/dmemsys/SMART/blob/1274148c849af987eb7e9815da007af8403d4155/include/Common.h#L23
@MiraHyc
I update the value of the CPU_PHYSICAL_CORE_NUM but the situation does't change.
图明 @.***
------------------ 原始邮件 ------------------ 发件人: "Xuchuan @.>; 发送时间: 2024年5月28日(星期二) 下午3:32 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [dmemsys/SMART] How to set up the master node (Issue #12)
This may happen when binding to a core with an ID that exceeds the number of physical cores. In our code, we calculate the ID of the last core using the CPU_PHYSICAL_CORE_NUM parameter: https://github.com/dmemsys/SMART/blob/1274148c849af987eb7e9815da007af8403d4155/src/Directory.cpp#L32
Thus, you should update the value of CPU_PHYSICAL_CORE_NUM to the actual number of physical CPU cores of your own machine: https://github.com/dmemsys/SMART/blob/1274148c849af987eb7e9815da007af8403d4155/include/Common.h#L23
@MiraHyc
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
Could you please tell me how many threads you launch? And what is the number of the physical cores of your machine (using the command like grep "physical id" /proc/cpuinfo | sort | uniq -c | awk '{print $1}'
) ?
@MiraHyc
Oh, I only have 2 physical CPU. But I have 32 CPU cores. Why we should use the number of the physical cpu instead of logical cpu?
图明 @.***
------------------ 原始邮件 ------------------ 发件人: "dmemsys/SMART" @.>; 发送时间: 2024年5月29日(星期三) 晚上8:57 @.>; @.**@.>; 主题: Re: [dmemsys/SMART] How to set up the master node (Issue #12)
Could you please tell me how many threads you launch? And what is the number of the physical cores of your machine (using the command like cat /proc/cpuinfo | grep 'physical id' | sort | uniq | wc -l ) ?
@MiraHyc
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
I change the number of physical CPU to 2 and I launch 24 threads. And still some threads can not bind to cores.
图明 @.***
------------------ 原始邮件 ------------------ 发件人: "dmemsys/SMART" @.>; 发送时间: 2024年5月29日(星期三) 晚上8:57 @.>; @.**@.>; 主题: Re: [dmemsys/SMART] How to set up the master node (Issue #12)
Could you please tell me how many threads you launch? And what is the number of the physical cores of your machine (using the command like cat /proc/cpuinfo | grep 'physical id' | sort | uniq | wc -l ) ?
@MiraHyc
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
We bind each thread to one physical core so that their performances won't be affected by the context switching, and thus each thread can serve as an individual client. Besides, we only bind to CPUs in the NUMA close to the RNIC for better performance. https://github.com/dmemsys/SMART/blob/1274148c849af987eb7e9815da007af8403d4155/test/ycsb_test.cpp#L207
Therefore, if the machine has 32 physical cores and 2 NUMA nodes, you should change the value of CPU_PHYSICAL_CORE_NUM
to 32, and can only launch 16 client threads at most.
Since I still cannot figure out the CPU information of your machine, could you please provide the entire output information with commands cat /proc/cpuinfo
and numactl --hardware
, respectively?
When I try to debug the code. I encounter a " can not support device memory " problem. Could you please tell me how can I fix this problem? Thank you very much.
图明 @.***
------------------ 原始邮件 ------------------ 发件人: "dmemsys/SMART" @.>; 发送时间: 2024年5月30日(星期四) 下午3:47 @.>; @.**@.>; 主题: Re: [dmemsys/SMART] How to set up the master node (Issue #12)
We bind each thread to one physical core so that their performances won't be affected by the context switching, and thus each thread can serve as an individual client. Besides, we only bind to CPUs in the NUMA close to the RNIC for better performance. https://github.com/dmemsys/SMART/blob/1274148c849af987eb7e9815da007af8403d4155/test/ycsb_test.cpp#L207
Therefore, if the machine has 32 physical cores and 2 NUMA nodes, you should change the value of CPU_PHYSICAL_CORE_NUM to 32, and can only launch 16 client threads at most.
Since I still cannot figure out the CPU information of your machine, could you please provide the entire output information with commands cat /proc/cpuinfo and numactl --hardware, respectively?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
Do you turn off the write combining technique (i.e., the WRITE_COMBINING
compile option)? When it is turned off, we adopt the HOCL design of Sherman, which leverages the on-chip memory of RNICs. Since your RNICs do not support on-chip memory, it raises the "can not support device memory" error.
To solve this problem, just turn on the write combining technique by compiling the codes with cmake -DWRITE_COMBINING=on ..
or use RNICs that have on-chip memory.
@MiraHyc
Yes, but I encounter a time out problem.
图明 @.***
------------------ 原始邮件 ------------------ 发件人: "dmemsys/SMART" @.>; 发送时间: 2024年5月31日(星期五) 晚上7:37 @.>; @.**@.>; 主题: Re: [dmemsys/SMART] How to set up the master node (Issue #12)
Do you turn off the write combining technique (i.e., the WRITE_COMBINING compile option)? When it is turned off, we adopt the HOCL design of Sherman, which leverages the on-chip memory of RNICs. Since your RNICs do not support on-chip memory, it raises the "can not support device memory" error.
To solve this problem, just turn on the write combining technique by compiling the codes with cmake -DWRITE_COMBINING .. or use RNICs that have on-chip memory.
@MiraHyc
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
You should TURN ON the WRITE_COMBINING
option with cmake -DWRITE_COMBINING=on ..
if your RNIC cannot support device memory.
As for the time-out problem, could you please provide more details? Which workload did you run? What is the entire command you executed? What is the entire error message?
I'd appreciate it if you could provide the details so that I can help you locate the bug. @MiraHyc
I just run the fig.3a in the exp folder with the default setting and just change the write combining option. And YCSB C is run. The whole message concerning the error is as below.
图明 @.***
------------------ 原始邮件 ------------------ 发件人: "dmemsys/SMART" @.>; 发送时间: 2024年6月2日(星期天) 凌晨0:05 @.>; @.**@.>; 主题: Re: [dmemsys/SMART] How to set up the master node (Issue #12)
Could you please provide more details about the time-out problem? Which workload did you run? What is the entire command you executed? What is the entire error message?
I'd appreciate it if you could provide the details so that I can help you locate the bug. @MiraHyc
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
I still cannot see the whole message. It seems that the picture you uploaded fails to show on GitHub. @MiraHyc
I just run the fig.3a in the exp folder with the default setting and just change the write combining option. And YCSB C is run. The whole message concerning the error is as below.
图明 @.***
Yes, I see. But I don't know why. So I just copy the error below. Error! Retry... Function all_long_execute (args=(<utils.cmd_manager.CMDManager object at 0x7feb470c0710>, 'cd /home/zjh/SMART/build && ./ycsb_test 2 4 2 email c', 2)) (kwargs={}) timed out after 600.000000 seconds.
It seems that you have changed the parameters in exp/params/fig_3a.json
. Please show me the changes.
If you have also modified the script exp/fig_3a.py
(to change the write combining option), please show me the changes too.
By the way, have you downloaded the email
workloads?
@MiraHyc
I just change the CN number in json file as below. "client_num": [[2, 4], [2, 8], [2, 16], [2, 24], [2, 32]], And add one line in fig_3a.py as below. cmake_option = cmake_options[method].replace('-DWRITE_COMBINING=off', '-DWRITE_COMBINING=on') Yes, I download the email workload and put it in the ycsb directory.
I find that it is caused by the implementation of ROWEX in naive ART, where we adopt the HOCL of Sherman for better performance: https://github.com/dmemsys/SMART/blob/1274148c849af987eb7e9815da007af8403d4155/src/Tree.cpp#L590C1-L596C5
Since the HOCL requires acquiring locks inside the on-chip memory, which is unavailable in your RNICs, the lock operation fails and the program is blocked. So the naive ART (fig.3a) code cannot run on your testbed. Therefore, I suggest you just run the SMART code with all options enabled or change your testbed (e.g., use CloudLab).
@MiraHyc
I use the Mellanox ConnectX- 4 NIC for the experiment. I didn't find the relevant information about the on-chip memory in the data sheet. But I think usually they are equipped with the on-chip memory. I wonder what else could contribute to this error?
The Mellanox ConnectX-5 NICs or above are needed to access the on-chip memory, as mentioned in the repo of Sherman. If you don't want to use CloudLab (which I really recommend), you should modify the codes of HOCL to resolve the interface incompatibility.
@MiraHyc
Thank you so much. I will try to use Cloudlab. I gave up this platform because the connection is not so stable and you have to reserve in advance, not so convenient for me.
You're welcome.
Could you please provide the detail type of RNIC you use, like the specific model on the official website? We really want to reproduce your results in our own experiment environment.
I wonder how the master node is set up in the experiment. As you say, the IP address of a master node of the r650 cluster is the node which can directly establish SSH connections to other nodes. So how can I make a node that can connect to other nodes in Cloudlab. Is it OK if I just change the master_ip parameter in the code?
Thank you!