Open bosilca opened 11 years ago
George, this problem should be fixed in my branch (gvallee/sock). There is also a test (src/tests/connect_reject.c) to test it. Let me know if you have any other problem. Thanks,
Guys, is this fixed?
Scott
This segfault is fixed and pushed to master. I have another pending improvement related to reject.
On the trunk I now have another issues which might be related to the way connections are released when using the sock device. I use mpi_ping.c but changed the startup size to 16k. At this message size the first message is using the rendez-vous protocol and thus forces an RDMA transfer from the beginning. I consistently get a segfault deep inside the CCI device, segfault triggered always by the CCI internal thread. The problem seems to be an incorrect connection pointer in what seems to be a legit sconn structures.
Here is a backtrace:
George,
Can you try with tcp as well?
Geoffroy, can you take a look at this?
Scott
On May 22, 2013, at 3:30 PM, bosilca notifications@github.com wrote:
On the trunk I now have another issues which might be related to the way connections are released when using the sock device. I use mpi_ping.c but changed the startup size to 16k. At this message size the first message is using the rendez-vous protocol and thus forces an RDMA transfer from the beginning. I consistently get a segfault deep inside the CCI device, segfault triggered always by the CCI internal thread. The problem seems to be an incorrect connection pointer in what seems to be a legit sconn structures.
Here is a backtrace:
0 0x0000000104aac01b in cci_conn_is_reliable (conn=0x3000000000000000) at cci_lib_types.h:145
1 0x0000000104ab21d6 in pack_piggyback_ack (ep=0x7f8c0ac636f0, sconn=0x7f8c0c003690, tx=0x104af9000) at ctp_sock_api.c:2169
2 0x0000000104ab1d42 in sock_progress_pending (ep=0x7f8c0ac636f0) at ctp_sock_api.c:2106
3 0x0000000104ab2eaa in sock_progress_sends (ep=0x7f8c0ac636f0) at ctp_sock_api.c:2467
4 0x0000000104abaed0 in sock_progress_thread (arg=0x7f8c0ac636f0) at ctp_sock_api.c:5070
5 0x00007fff8c239742 in _pthread_start ()
6 0x00007fff8c226181 in thread_start ()
— Reply to this email directly or view it on GitHub.
Works fine with TCP. There seems to be some data scrambling on the wire, I'll take a look at this using TCP until the sock is fixed.
I just tried with my branch and did not get any segfault. I will try with master in a moment. BTW, the BTL code with the sock transport sometimes tries to send more data than it is allowed to.
The behavior I described in the ticket was with the master.
I would definitively be interested in hearing about the case when it sends more data. Can you please elaborate a little bit on the circumstances when this happens?
Thanks, George.
On May 22, 2013, at 17:45 , gvallee notifications@github.com wrote:
I just tried with my branch and did not get any segfault. I will try with master in a moment. BTW, the BTL code with the sock transport sometimes tries to send more data than it is allowed to.
— Reply to this email directly or view it on GitHub.
I cannot elaborate much more yet since the overall behavior of the test is not consistent from one run to another, which make debugging more difficult. As soon as i have more details, i will put them here. Also, i will very soon push to my branch a few modifications to the sock transport that correctly handles the situation: it returns an error if the payload size is too big.
The use of cci_reject with sock leads to either a deadlock or a segfault. The same code works fine with verbs.
Here is a stack trace when it segfault (it doesn't always segfault, sometimes it just deadlock).
0 0x00007f82f945da26 in sock_progress_queued (ep=0xa636a0)
1 0x00007f82f945dd55 in sock_progress_sends (ep=0xa636a0)
2 0x00007f82f946571f in sock_progress_thread (arg=0xa636a0)
3 0x00007f82ffa91b50 in start_thread (arg=) at pthread_create.c:304
Digging deeper in the core, I think I identified the issue. When the rx (type cci_event_connect_request_t) event is created it is correctly initialized. On the cci_reject call, as no connection exists yet the newly created tx event is tagged with a connection set to NULL. If this event end up in the queues and gets processes later on by sock_progress_queued, the NULL connections is just asking for troubles.