SJTU-IPADS / drtmh

Fast In-memory Transaction Processing using Hybrid RDMA Primitives
67 stars 18 forks source link

link_connect_qps() retries to link all qps forever #3

Open Alchem-Lab opened 5 years ago

Alchem-Lab commented 5 years ago

Hi rocc developers,

I am trying to use the rocc framework in my own code to support rdma-based communication. Basically I am trying to use the RWorker class to as the base class in my own code to model thread creation and routine scheduling. My thread class inherits the RWorker class just like bench_workers did in rocc's code. However, I met with some difficulties using the rocc framework as well as the librdma library. As one simple demo, I started one server node and one client node. The server node spawns 4 RWorkers and the client node spawns 3 RWorkers. At the time when the client finished initializing all the 4 RWorkers, some of the server workers stuck in the rdmaio::RdmaCtrl::link_connect_qps() function and cannot connect qps successfully to the other node and thus they retry forever (see the while(1) loop in the link_connect_qps function). Essentially, the PreConnector::get_send_socket() function called by the Qp::connect_rc() function will always return a negative socket value and thus will cause the next retry. Even the recv_thread spawned by the librdma library are good in consistently accepting new tcp connection requests, get_send_socket function consistently fail. I noticed that link_connect_qps() function will retry every 200ms until all qps in the cluster are linked. Is this guranteed to work correctly? In my case, it indeed connects forever. I am wondering if you guys have any idea to help me solve this issue. Thank you!

wxdwfc commented 5 years ago

Hi,

Thanks for trying ROCC !

Basically, ROCC is guaranteed to work correctly, if the condition is satisfied. I think the main problem is that: RWorker assumes each machine has the same number of worker threads. I see that you create 3 workers on client, while the server has 4. So the 4th thread on the server will try to connect to the QP created on the 4th thread at the client (since each QP can only be connected once). So it stucks.

The simple solution is to remove this code, and manually connect QPs using the connect() method in RLib.You can reference the LibRDMA’s readme to manually connect QPs.(https://github.com/wxdwfc/rlib/ https://github.com/wxdwfc/rlib/)

在 2019年2月20日,下午6:22,Alchem Lab notifications@github.com 写道:

Hi rocc developers,

I am trying to use the rocc framework in my own code to support rdma-based communication. Basically I am trying to use the RWorker class to as the base class in my own code to model thread creation and routine scheduling. My thread class inherits the RWorker class just like bench_workers did in rocc's code. However, I met with some difficulties using the rocc framework as well as the librdma library. As one simple demo, I started one server node and one client node. The server node spawns 4 RWorkers and the client node spawns 3 RWorkers. At the time when the client finished initializing all the 4 RWorkers, some of the server workers stuck in the rdmaio::RdmaCtrl::link_connect_qps() function and cannot connect qps successfully to the other node and thus they retry forever (see the while(1) loop in the link_connect_qps function). Essentially, the PreConnector::get_send_socket() function called by the Qp::connect_rc() function will always return a negative socket value will will cause the next retry. Even the recv_thread spawned by the librdma library are good in consistently accepting new tcp connection requests, get_send_socket function is consistently fail. I noticed that link_connect_qps() function will retry every 200ms until all qps in the cluster are linked. Is this guranteed to work correctly? In my case, it indeed connects forever. I am wondering if you guys have any idea to help me solve this issue. Thank you!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/SJTU-IPADS/drtmh/issues/3, or mute the thread https://github.com/notifications/unsubscribe-auth/ADNREo4GPY_uFbEj2TjkjuJatTIXzt6Aks5vPRO5gaJpZM4bEwwD.

albertghtoun commented 5 years ago

Thanks for clarifying the problem. I have solved the problem by using same number of RWorker threads for all machines in the cluster. By having the same number of RWorkers, does rocc assumes that one RWorker's qp only connects to the corresponding qp of the same RWorker (i.e., the same worker id) in the cluster for all machines? And thus only threads with the same id can communicate through rdma, am I correct?

I am now using the old librdma library and haven't try the new rlib library yet. But I want to have one sender thread in one machine communicating with a receiver thread in another machine using rdma. These two threads may have different ids. is it possible to use rocc framework in such a use case?

Thanks!

wxdwfc commented 5 years ago

one RWorker's qp only connects to the corresponding qp of the same RWorker (i.e., the same worker id) in the cluster for all machines?

Yes, in default RWorker, we assume this for the easy of usage. And thus only threads with the same id can communicate through rdma, am I correct?

No. This is because RWorker contains 2 parts:

s it possible to use rocc framework in such a use case?

Yes its possible, and is not restricted by RWorker. You only need to change the way QP is connected and created to adapt RWorker to your case (the first part). If you want such a use case, you should use the connect method, which makes it possible to connect to arbitrary QP. This is provided in our new lib.

Best, XingDa Wei The institute of parallel and distributed systems, Shanghai Jiao Tong University

在 2019年2月22日,下午3:09,Chao Wang notifications@github.com 写道:

Thanks for clarifying the problem. I have solved the problem by using same number of RWorker threads for all machines in the cluster. By having the same number of RWorkers, does rocc assumes that one RWorker's qp only connects to the corresponding qp of the same RWorker (i.e., the same worker id) in the cluster for all machines? And thus only threads with the same id can communicate through rdma, am I correct?

I am now using the old librdma library and haven't try the new rlib library yet. But I want to have one sender thread in one machine communicating with a receiver thread in another machine using rdma. These two threads may have different ids. is it possible to use rocc framework in such a use case?

Thanks!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/SJTU-IPADS/drtmh/issues/3#issuecomment-466296828, or mute the thread https://github.com/notifications/unsubscribe-auth/ADNREmn-35cb2-ls7QNzyWO-Pq5QxSjUks5vP5e4gaJpZM4bEwwD.