dmemsys / SMART

This is the implementation repository of our OSDI'23 paper: SMART: A High-Performance Adaptive Radix Tree for Disaggregated Memory.
MIT License
56 stars 15 forks source link

How to set up the master node #12

Closed MiraHyc closed 5 months ago

MiraHyc commented 8 months ago

I wonder how the master node is set up in the experiment. As you say, the IP address of a master node of the r650 cluster is the node which can directly establish SSH connections to other nodes. So how can I make a node that can connect to other nodes in Cloudlab. Is it OK if I just change the master_ip parameter in the code?

Thank you!

River861 commented 7 months ago

Hi MiraHyc, thanks for your attention!

The master node can be any node in the r650 cluster you build. You can change the master_ip parameter to the IP address of any node you like in the cluster.

In CloudLab, some r650 nodes can connect to other nodes without any settings. For nodes without this capability, you can generate an SSH key pair on the node using ssh-keygen and copy the public key to all the other nodes. Then, the node can directly establish SSH connections to the other nodes without requiring a password.

Hopefully this information will help you! @MiraHyc

MiraHyc commented 6 months ago

Thank you very much.  Currently I try to use my own machine to reproduce the results.But a "mmap failed" happened when I try to conduct a ycsb test. I wonder how can this problem be solved. Thanks again for your time and patience!

图明 @.***

 

------------------ 原始邮件 ------------------ 发件人: "dmemsys/SMART" @.>; 发送时间: 2024年3月29日(星期五) 中午11:26 @.>; @.**@.>; 主题: Re: [dmemsys/SMART] How to set up the master node (Issue #12)

Hi MiraHyc, thanks for your attention!

The master node can be any node in the r650 cluster you build. You can change the master_ip parameter to the IP address of any node you like in the cluster.

In CloudLab, some r650 nodes can connect to other nodes without any settings. For nodes without this capability, you can generate an SSH key pair on the node using ssh-keygen and copy the public key to all the other nodes. Then, the node can directly establish SSH connections to the other nodes without requiring a password.

Hopefully this information will help you! @MiraHyc

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

River861 commented 6 months ago

Hi MiraHyc,

The "mmap failed" may happen when there is not enough amount of huge pages. To solve this, you can try the following command:

echo 36864 > /proc/sys/vm/nr_hugepages

Hopefully this information will help you! @MiraHyc

MiraHyc commented 5 months ago

Yes. Thank you very much.

图明 @.***

 

------------------ 原始邮件 ------------------ 发件人: "dmemsys/SMART" @.>; 发送时间: 2024年5月8日(星期三) 晚上11:02 @.>; @.**@.>; 主题: Re: [dmemsys/SMART] How to set up the master node (Issue #12)

Hi MiraHyc,

The "mmap failed" may happen when there is not enough amount of huge pages. To solve this, you can try the following command: echo 36864 > /proc/sys/vm/nr_hugepages

Hopefully this information will help you! @MiraHyc

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

MiraHyc commented 5 months ago

Sometimes when I am doing the test, some threads can't bind to cores and " can't bind core“ appears in the terminal appear togethet with other output. I wonder why this would happen.

图明 @.***

 

------------------ 原始邮件 ------------------ 发件人: "dmemsys/SMART" @.>; 发送时间: 2024年5月8日(星期三) 晚上10:55 @.>; @.**@.>; 主题: Re: [dmemsys/SMART] How to set up the master node (Issue #12)

Reopened #12.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

River861 commented 5 months ago

This may happen when binding to a core with an ID that exceeds the number of physical cores. In our code, we calculate the ID of the last core using the CPU_PHYSICAL_CORE_NUM parameter: https://github.com/dmemsys/SMART/blob/1274148c849af987eb7e9815da007af8403d4155/src/Directory.cpp#L32

Thus, you should update the value of CPU_PHYSICAL_CORE_NUM to the actual number of physical CPU cores of your own machine: https://github.com/dmemsys/SMART/blob/1274148c849af987eb7e9815da007af8403d4155/include/Common.h#L23

@MiraHyc

MiraHyc commented 5 months ago

I update the value of the CPU_PHYSICAL_CORE_NUM but the situation does't change. 

图明 @.***

 

------------------ 原始邮件 ------------------ 发件人: "Xuchuan @.>; 发送时间: 2024年5月28日(星期二) 下午3:32 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [dmemsys/SMART] How to set up the master node (Issue #12)

This may happen when binding to a core with an ID that exceeds the number of physical cores. In our code, we calculate the ID of the last core using the CPU_PHYSICAL_CORE_NUM parameter: https://github.com/dmemsys/SMART/blob/1274148c849af987eb7e9815da007af8403d4155/src/Directory.cpp#L32

Thus, you should update the value of CPU_PHYSICAL_CORE_NUM to the actual number of physical CPU cores of your own machine: https://github.com/dmemsys/SMART/blob/1274148c849af987eb7e9815da007af8403d4155/include/Common.h#L23

@MiraHyc

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

River861 commented 5 months ago

Could you please tell me how many threads you launch? And what is the number of the physical cores of your machine (using the command like grep "physical id" /proc/cpuinfo | sort | uniq -c | awk '{print $1}') ?

@MiraHyc

MiraHyc commented 5 months ago

Oh, I only have 2 physical CPU. But I have 32 CPU cores. Why we should use the number of the physical cpu instead of logical cpu?

图明 @.***

 

------------------ 原始邮件 ------------------ 发件人: "dmemsys/SMART" @.>; 发送时间: 2024年5月29日(星期三) 晚上8:57 @.>; @.**@.>; 主题: Re: [dmemsys/SMART] How to set up the master node (Issue #12)

Could you please tell me how many threads you launch? And what is the number of the physical cores of your machine (using the command like cat /proc/cpuinfo | grep 'physical id' | sort | uniq | wc -l ) ?

@MiraHyc

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

MiraHyc commented 5 months ago

I change the number of physical CPU to 2 and I launch 24 threads. And still some threads can not bind to cores.

图明 @.***

 

------------------ 原始邮件 ------------------ 发件人: "dmemsys/SMART" @.>; 发送时间: 2024年5月29日(星期三) 晚上8:57 @.>; @.**@.>; 主题: Re: [dmemsys/SMART] How to set up the master node (Issue #12)

Could you please tell me how many threads you launch? And what is the number of the physical cores of your machine (using the command like cat /proc/cpuinfo | grep 'physical id' | sort | uniq | wc -l ) ?

@MiraHyc

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

River861 commented 5 months ago

We bind each thread to one physical core so that their performances won't be affected by the context switching, and thus each thread can serve as an individual client. Besides, we only bind to CPUs in the NUMA close to the RNIC for better performance. https://github.com/dmemsys/SMART/blob/1274148c849af987eb7e9815da007af8403d4155/test/ycsb_test.cpp#L207

Therefore, if the machine has 32 physical cores and 2 NUMA nodes, you should change the value of CPU_PHYSICAL_CORE_NUM to 32, and can only launch 16 client threads at most.

Since I still cannot figure out the CPU information of your machine, could you please provide the entire output information with commands cat /proc/cpuinfo and numactl --hardware, respectively?

MiraHyc commented 5 months ago

When I try to debug the code. I encounter a " can not support device memory " problem. Could you please tell me how can I fix this problem? Thank you very much.

图明 @.***

 

------------------ 原始邮件 ------------------ 发件人: "dmemsys/SMART" @.>; 发送时间: 2024年5月30日(星期四) 下午3:47 @.>; @.**@.>; 主题: Re: [dmemsys/SMART] How to set up the master node (Issue #12)

We bind each thread to one physical core so that their performances won't be affected by the context switching, and thus each thread can serve as an individual client. Besides, we only bind to CPUs in the NUMA close to the RNIC for better performance. https://github.com/dmemsys/SMART/blob/1274148c849af987eb7e9815da007af8403d4155/test/ycsb_test.cpp#L207

Therefore, if the machine has 32 physical cores and 2 NUMA nodes, you should change the value of CPU_PHYSICAL_CORE_NUM to 32, and can only launch 16 client threads at most.

Since I still cannot figure out the CPU information of your machine, could you please provide the entire output information with commands cat /proc/cpuinfo and numactl --hardware, respectively?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

River861 commented 5 months ago

Do you turn off the write combining technique (i.e., the WRITE_COMBINING compile option)? When it is turned off, we adopt the HOCL design of Sherman, which leverages the on-chip memory of RNICs. Since your RNICs do not support on-chip memory, it raises the "can not support device memory" error.

To solve this problem, just turn on the write combining technique by compiling the codes with cmake -DWRITE_COMBINING=on .. or use RNICs that have on-chip memory.

@MiraHyc

MiraHyc commented 5 months ago

Yes, but I encounter a time out problem.

图明 @.***

 

------------------ 原始邮件 ------------------ 发件人: "dmemsys/SMART" @.>; 发送时间: 2024年5月31日(星期五) 晚上7:37 @.>; @.**@.>; 主题: Re: [dmemsys/SMART] How to set up the master node (Issue #12)

Do you turn off the write combining technique (i.e., the WRITE_COMBINING compile option)? When it is turned off, we adopt the HOCL design of Sherman, which leverages the on-chip memory of RNICs. Since your RNICs do not support on-chip memory, it raises the "can not support device memory" error.

To solve this problem, just turn on the write combining technique by compiling the codes with cmake -DWRITE_COMBINING .. or use RNICs that have on-chip memory.

@MiraHyc

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

River861 commented 5 months ago

You should TURN ON the WRITE_COMBINING option with cmake -DWRITE_COMBINING=on .. if your RNIC cannot support device memory.

As for the time-out problem, could you please provide more details? Which workload did you run? What is the entire command you executed? What is the entire error message?

I'd appreciate it if you could provide the details so that I can help you locate the bug. @MiraHyc

MiraHyc commented 5 months ago

I just run the fig.3a in the exp folder with the default setting and just change the write combining option. And YCSB C is run. The whole message concerning the error is as below.

图明 @.***

 

------------------ 原始邮件 ------------------ 发件人: "dmemsys/SMART" @.>; 发送时间: 2024年6月2日(星期天) 凌晨0:05 @.>; @.**@.>; 主题: Re: [dmemsys/SMART] How to set up the master node (Issue #12)

Could you please provide more details about the time-out problem? Which workload did you run? What is the entire command you executed? What is the entire error message?

I'd appreciate it if you could provide the details so that I can help you locate the bug. @MiraHyc

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

River861 commented 5 months ago

I still cannot see the whole message. It seems that the picture you uploaded fails to show on GitHub. @MiraHyc

I just run the fig.3a in the exp folder with the default setting and just change the write combining option. And YCSB C is run. The whole message concerning the error is as below.

图明 @.***

 

MiraHyc commented 5 months ago

Yes, I see. But I don't know why. So I just copy the error below. Error! Retry... Function all_long_execute (args=(<utils.cmd_manager.CMDManager object at 0x7feb470c0710>, 'cd /home/zjh/SMART/build && ./ycsb_test 2 4 2 email c', 2)) (kwargs={}) timed out after 600.000000 seconds.

River861 commented 5 months ago

It seems that you have changed the parameters in exp/params/fig_3a.json. Please show me the changes. If you have also modified the script exp/fig_3a.py (to change the write combining option), please show me the changes too. By the way, have you downloaded the email workloads?

@MiraHyc

MiraHyc commented 5 months ago

I just change the CN number in json file as below. "client_num": [[2, 4], [2, 8], [2, 16], [2, 24], [2, 32]], And add one line in fig_3a.py as below. cmake_option = cmake_options[method].replace('-DWRITE_COMBINING=off', '-DWRITE_COMBINING=on') Yes, I download the email workload and put it in the ycsb directory.

River861 commented 5 months ago

I find that it is caused by the implementation of ROWEX in naive ART, where we adopt the HOCL of Sherman for better performance: https://github.com/dmemsys/SMART/blob/1274148c849af987eb7e9815da007af8403d4155/src/Tree.cpp#L590C1-L596C5

Since the HOCL requires acquiring locks inside the on-chip memory, which is unavailable in your RNICs, the lock operation fails and the program is blocked. So the naive ART (fig.3a) code cannot run on your testbed. Therefore, I suggest you just run the SMART code with all options enabled or change your testbed (e.g., use CloudLab).

@MiraHyc

MiraHyc commented 5 months ago

I use the Mellanox ConnectX- 4 NIC for the experiment. I didn't find the relevant information about the on-chip memory in the data sheet. But I think usually they are equipped with the on-chip memory. I wonder what else could contribute to this error?

River861 commented 5 months ago

The Mellanox ConnectX-5 NICs or above are needed to access the on-chip memory, as mentioned in the repo of Sherman. If you don't want to use CloudLab (which I really recommend), you should modify the codes of HOCL to resolve the interface incompatibility.

@MiraHyc

MiraHyc commented 5 months ago

Thank you so much. I will try to use Cloudlab. I gave up this platform because the connection is not so stable and you have to reserve in advance, not so convenient for me.

River861 commented 5 months ago

You're welcome.

MiraHyc commented 5 months ago

Could you please provide the detail type of RNIC you use, like the specific model on the official website? We really want to reproduce your results in our own experiment environment.