bosilca commented 11 years ago

The use of cci_reject with sock leads to either a deadlock or a segfault. The same code works fine with verbs.

Here is a stack trace when it segfault (it doesn't always segfault, sometimes it just deadlock).

0 0x00007f82f945da26 in sock_progress_queued (ep=0xa636a0)

at ../../../../../trunk/src/plugins/ctp/sock/ctp_sock_api.c:2218

1 0x00007f82f945dd55 in sock_progress_sends (ep=0xa636a0)

at ../../../../../trunk/src/plugins/ctp/sock/ctp_sock_api.c:2419

2 0x00007f82f946571f in sock_progress_thread (arg=0xa636a0)

at ../../../../../trunk/src/plugins/ctp/sock/ctp_sock_api.c:4940

3 0x00007f82ffa91b50 in start_thread (arg=) at pthread_create.c:304

Digging deeper in the core, I think I identified the issue. When the rx (type cci_event_connect_request_t) event is created it is correctly initialized. On the cci_reject call, as no connection exists yet the newly created tx event is tagged with a connection set to NULL. If this event end up in the queues and gets processes later on by sock_progress_queued, the NULL connections is just asking for troubles.

gvallee commented 11 years ago

George, this problem should be fixed in my branch (gvallee/sock). There is also a test (src/tests/connect_reject.c) to test it. Let me know if you have any other problem. Thanks,

scottatchley commented 11 years ago

Guys, is this fixed?

Scott

gvallee commented 11 years ago

This segfault is fixed and pushed to master. I have another pending improvement related to reject.

bosilca commented 11 years ago

On the trunk I now have another issues which might be related to the way connections are released when using the sock device. I use mpi_ping.c but changed the startup size to 16k. At this message size the first message is using the rendez-vous protocol and thus forces an RDMA transfer from the beginning. I consistently get a segfault deep inside the CCI device, segfault triggered always by the CCI internal thread. The problem seems to be an incorrect connection pointer in what seems to be a legit sconn structures.

Here is a backtrace:

0 0x0000000104aac01b in cci_conn_is_reliable (conn=0x3000000000000000) at cci_lib_types.h:145

1 0x0000000104ab21d6 in pack_piggyback_ack (ep=0x7f8c0ac636f0, sconn=0x7f8c0c003690, tx=0x104af9000) at ctp_sock_api.c:2169

2 0x0000000104ab1d42 in sock_progress_pending (ep=0x7f8c0ac636f0) at ctp_sock_api.c:2106

3 0x0000000104ab2eaa in sock_progress_sends (ep=0x7f8c0ac636f0) at ctp_sock_api.c:2467

4 0x0000000104abaed0 in sock_progress_thread (arg=0x7f8c0ac636f0) at ctp_sock_api.c:5070

5 0x00007fff8c239742 in _pthread_start ()

6 0x00007fff8c226181 in thread_start ()

scottatchley commented 11 years ago

George,

Can you try with tcp as well?

Geoffroy, can you take a look at this?

Scott

On May 22, 2013, at 3:30 PM, bosilca notifications@github.com wrote:

On the trunk I now have another issues which might be related to the way connections are released when using the sock device. I use mpi_ping.c but changed the startup size to 16k. At this message size the first message is using the rendez-vous protocol and thus forces an RDMA transfer from the beginning. I consistently get a segfault deep inside the CCI device, segfault triggered always by the CCI internal thread. The problem seems to be an incorrect connection pointer in what seems to be a legit sconn structures.

Here is a backtrace:

0 0x0000000104aac01b in cci_conn_is_reliable (conn=0x3000000000000000) at cci_lib_types.h:145

1 0x0000000104ab21d6 in pack_piggyback_ack (ep=0x7f8c0ac636f0, sconn=0x7f8c0c003690, tx=0x104af9000) at ctp_sock_api.c:2169

2 0x0000000104ab1d42 in sock_progress_pending (ep=0x7f8c0ac636f0) at ctp_sock_api.c:2106

3 0x0000000104ab2eaa in sock_progress_sends (ep=0x7f8c0ac636f0) at ctp_sock_api.c:2467

4 0x0000000104abaed0 in sock_progress_thread (arg=0x7f8c0ac636f0) at ctp_sock_api.c:5070

5 0x00007fff8c239742 in _pthread_start ()

6 0x00007fff8c226181 in thread_start ()

— Reply to this email directly or view it on GitHub.

bosilca commented 11 years ago

Works fine with TCP. There seems to be some data scrambling on the wire, I'll take a look at this using TCP until the sock is fixed.

gvallee commented 11 years ago

I just tried with my branch and did not get any segfault. I will try with master in a moment. BTW, the BTL code with the sock transport sometimes tries to send more data than it is allowed to.

bosilca commented 11 years ago

The behavior I described in the ticket was with the master.

I would definitively be interested in hearing about the case when it sends more data. Can you please elaborate a little bit on the circumstances when this happens?

Thanks, George.

On May 22, 2013, at 17:45 , gvallee notifications@github.com wrote:

I just tried with my branch and did not get any segfault. I will try with master in a moment. BTW, the BTL code with the sock transport sometimes tries to send more data than it is allowed to.

— Reply to this email directly or view it on GitHub.

gvallee commented 11 years ago

I cannot elaborate much more yet since the overall behavior of the test is not consistent from one run to another, which make debugging more difficult. As soon as i have more details, i will put them here. Also, i will very soon push to my branch a few modifications to the sock transport that correctly handles the situation: it returns an error if the payload size is too big.

CCI / cci

CCI sock and cci_reject #29

0 0x00007f82f945da26 in sock_progress_queued (ep=0xa636a0)

1 0x00007f82f945dd55 in sock_progress_sends (ep=0xa636a0)

2 0x00007f82f946571f in sock_progress_thread (arg=0xa636a0)

3 0x00007f82ffa91b50 in start_thread (arg=) at pthread_create.c:304

0 0x0000000104aac01b in cci_conn_is_reliable (conn=0x3000000000000000) at cci_lib_types.h:145

1 0x0000000104ab21d6 in pack_piggyback_ack (ep=0x7f8c0ac636f0, sconn=0x7f8c0c003690, tx=0x104af9000) at ctp_sock_api.c:2169

2 0x0000000104ab1d42 in sock_progress_pending (ep=0x7f8c0ac636f0) at ctp_sock_api.c:2106

3 0x0000000104ab2eaa in sock_progress_sends (ep=0x7f8c0ac636f0) at ctp_sock_api.c:2467

4 0x0000000104abaed0 in sock_progress_thread (arg=0x7f8c0ac636f0) at ctp_sock_api.c:5070

5 0x00007fff8c239742 in _pthread_start ()

6 0x00007fff8c226181 in thread_start ()

0 0x0000000104aac01b in cci_conn_is_reliable (conn=0x3000000000000000) at cci_lib_types.h:145

1 0x0000000104ab21d6 in pack_piggyback_ack (ep=0x7f8c0ac636f0, sconn=0x7f8c0c003690, tx=0x104af9000) at ctp_sock_api.c:2169

2 0x0000000104ab1d42 in sock_progress_pending (ep=0x7f8c0ac636f0) at ctp_sock_api.c:2106

3 0x0000000104ab2eaa in sock_progress_sends (ep=0x7f8c0ac636f0) at ctp_sock_api.c:2467

4 0x0000000104abaed0 in sock_progress_thread (arg=0x7f8c0ac636f0) at ctp_sock_api.c:5070

5 0x00007fff8c239742 in _pthread_start ()

6 0x00007fff8c226181 in thread_start ()