Open wangshuaizs opened 6 years ago
I never tried that. Sorry.
OK, thank you anyway!
I have got another trouble. I run a simulation that server node 0 - 126 connect to a broadcom switch, then server node 0 send 1 packet (pay load size =1000) to the rest of each server node. the result prints some warning: " WARNING: Drop because egress Port buffer full, WARNING: Drop because egress Q buffer full, WARNING: Drop because egress SP buffer full", I expected to see retransmission, but I can not find retransimission in mix.tr.
Even when I increase the number of server nodes to 129, which means that server node 0 will send 1 packet to server node 1 - 128, respectively, the main.exe crashes with error message like “0x0000010000001000 access violation occurs when the reading position.”
Does that mean I can not simulation more than 127 flows from one server simultaneously? I have tried to dig in your source code, but I find nothing to support this assumption. Could you please give me some suggestion? Thank you !
The main issue is on the switch node, not on the servers/flows.
I hard-coded a max port number of 64 per switch because this is what we had in practice (64-port switches). You may try to raise this. https://github.com/bobzhuyb/ns3-rdma/blob/master/src/network/model/broadcom-node.h#L59
Once you raise this, the switch buffer may run out easily -- remember PFC requires certain buffer headroom per port to operate, otherwise PFC cannot prevent packet losses. You may need to reconfigure buffer thresholds/capacity in https://github.com/bobzhuyb/ns3-rdma/blob/master/src/network/model/broadcom-node.cc
If you want to test 128->1 or even more intensive incast, I recommend you to stick with 64-port switches and use multi-hop topology. The congestion point will be at the last hop anyways. Then you don't need to worry about above issues on the switch.
@bobzhuyb
I tried to create a topology with 2 servers, named server 0 and server 1, connected to each other directly. And server 1 established 200 rdma flows to server 0 at the same time, but visual studio report errors that said memory access violation. Is it a bug?
Thank you!
I don't remember any hard-coded limitation for the number of flows per server... but I may be wrong. What is the maximum number of flows that does not have this problem? 128? 64?
@bobzhuyb
In my test, 127 flows are ok, but 128 flows aren't.
the problem is caused by the parameter in point-to-point/model/qbb-net-device.h you will find *static const uint32_t fCnt = 128; // Max number of flows on a NIC, for TX and RX respectively. TX+RX=fCnt2.** And you can increase this.but there is also a problem. when you finished a flow and start a new flow ,you will find this problem will appear again.Because there is none of queue recovery mechanism.
Thanks @hdtjiang for the explanation. This is indeed something that needs to be improved.
Thanks @hdtjiang for your reply. I think the parameter in network/utils/broadcom-egress-queue.h should also be increased accordingly:
static const unsigned fCnt = 128; //max number of queues, 128 for NICs
Hi, On ubuntu OS, pyviz and netanim can be used to visualize simualtion. Is there any tool supported by project ns3-rdma to visualize simulation? I have tried to generate .xml file in simulation, and then open this file in Ubuntu, but netanim told me "This XML format is not supported. Minimum Version:3.106" (the verison of netanim I used is 3.107, the version of ns-3 is 3.26). Do you have any suggestion about visualization? thank you in advance!