CESNET / libnetconf

C NETCONF library
Other
113 stars 83 forks source link

crash when deleting session #205

Closed ntadas closed 8 years ago

ntadas commented 8 years ago

Hi,

I have sometimes a crash, when some network problems occur and I have several sessions connected and monitored at the same time. this is not so easy to reproduce, so also not easy to analyse. Here is what I already found until now and my suspicions:

with this evidences it seems that the lib detects the failure before my server and deletes the session. then I try to delete the same session and I get this crash (at least this is my theory for now).

In the beginning of nc_session_free we can see the check if (session == NULL) but session is never set to NULL inside the lib (or at least I didn't find where).

The struct nc_session* that I give to the nc_session_free is obtained when the session is establish with the method nc_session_accept_inout

Since the session is never set to NULL inside the lib, should I check some flag inside nc_session struct before I call the nc_session_free?

thanks Best Regards

rkrejci commented 8 years ago

Are you on current master (59f5ba5)? The mentioned lines do not match what you describe (e.g. 1418 is comment). Ideally, try to test it with deadlockfix branch which changes nc_session_free() function.

The session == NULL check is just a parameter check to avoid referencing NULL pointer. The session is freed at the end of the function, but since the app holds the variable, it is up to it to set NULL to the variable (or do not use the variable after calling nc_session_free()). Internal libnetconf functions actually never call nc_session_free() on the sessions avavilable to the application. In case of error, the session is invalidated by nc_session_close(), but the structure is still available to the application which is supposed to call nc_session_free() to finnish the cleanup. Valgrind's stack at the moment of crash would be very helpful, but I understand that it is challenging to reproduce the issue.

ntadas commented 8 years ago

yes I'm in the master branch, sorry I've added some debug logs and forgot about them when writting the issue. session.c line 1418 is in fact 1423 : free(session->stats);

valgrind is not possible right now because I'm only able to reproduce this in target (ppc) until now I was not able to reproduce it in x86.

regarding setting the variable to NULL sure, I'm doing it. but how can I check if I can call the nc_session_free method? is it possible that the lib is closing/deleting the session and I'm trying to delete a session that doesn't exist? from the debug I've done, when this happens this if: if (strcmp(litem->session_id, session->session_id) == 0) is never true, so when this one is: if (litem->offset_next == 0) we get the crash here: free(session->stats);

rkrejci commented 8 years ago

Actually you can call nc_session_free() anytime you want. When an error is detected by some internal function, the session's transport session and some other stuff is destroyed and session's status is set to NC_SESSION_STATUS_CLOSED (you can check this via nc_session_get_status()). From this time, libnetconf will not able to use the session for anything meaningfull (getter functions should work, but you won't be able to send anything via the session, it is closed).

I have one idea about the issue, just please answer me the following questions:

  1. how do you call nc_init()? I mean, what flags do you pass to the function
  2. that PPC target, what OS is there? Is there /proc/ filesystem?
ntadas commented 8 years ago

1- I call nc_init with (NC_INIT_ALL & ~NC_INIT_NACM) | NC_INIT_MULTILAYER 2- Windriver Linux 32 bits, yes we have a /proc/ filesystem

rkrejci commented 8 years ago

hmm, then it is weird that the session is not found in the session_list (session.c:1399-1402). There are only 2 situations when the session is removed from the list:

  1. nc_session_free(), but that can be called just once on the session
  2. nc_session_alive_monitor(), but it should happen only if the process accepted the session is not alive (checked by PID and records in /proc/ - session.c:497)

Calling free(session->stats) when all items in session_list were checked and the session wasn't found there should be correct. Without session monitoring, the session structure has some allocated memory in stats, so it is supposed to be free. But in your case you shouldn't get there. According to nc_init(), the session monitoring is enabled, so the stats points to the shared memory block and the session should be always in the session_list - as I mentioned, the only other way to remove the session from the list is that the process that accepted the session is no more running and it was detected by nc_session_alive_monitor() (which runs whenever you get status data by <get> operation)..

ntadas commented 8 years ago

or could it be that during the delete/create of other sessions the prev and next offset are not correctly linked so the session that I'm trying to delete is not found while iterating in the list but in fact its there (just not linked to the others)?

rkrejci commented 8 years ago

yes, bug in session_list manipulation would explain that

ntadas commented 8 years ago

after adding some logs I found something strange. it seems that this some session are being closed by nc_session_monitor_alive_check but I really don't understand why, since the process that launches all the sessions is the same.

rkrejci commented 8 years ago

Do you know which part of the nc_session_monitor_alive_check() causes it? I mean, which of those 3 nc_session_monitor_remove() calls is used.

ntadas commented 8 years ago

unfortunately not, I add prints in nc_session_monitor_remove and in nc_session_free only.

I notice that nc_session_monitor_remove is called not from the nc_session_free and when session free is being called (afterwards) I get negative session_list->count

also this is only happening in ppc.

ntadas commented 8 years ago

its the last one (pfd == NULL)

this happens everytime I do a get all on target (ppc) and the number of sessions is decreased.

rkrejci commented 8 years ago

can you provide the output of the following command on the target system? The PID is PID number of your server process accepting the NETCONF sessions.

sudo ls -l /proc/PID/fd

There should be a symlink to libnetconf_sessions.bin file which should correspond to the aux string in the nc_session_monitor_alive_check() function.

rkrejci commented 8 years ago

If not, then, during the libnetconf init, there should be some error message from libnetconf about "sessions monitoring file" or the "session list".

ntadas commented 8 years ago

I think I got the problem:

lrwx------ 1 root root 64 May 3 00:41 20 -> /var/volatile/tmp/netconf/libnetconf_sessions.bin lrwx------ 1 root root 64 May 3 00:41 21 -> /var/volatile/tmp/netconf/libnetconf_sessions.bin

but the files aux is /tmp/netconf/libnetconf_sessions.bin and the file exists in /tmp/netconf/libnetconf_sessions.bin (no error in init)

if I do a "readlink -f /tmp/mf_communication_log.log" returns: /var/volatile/tmp/mf_communication_log.log

/tmp in my system is a link to volatile/tmp

I think in the lib code you should have a readlink of the aux to get the canonical file name

ntadas commented 8 years ago

or I can change it to /var/volatile/tmp/netconf/libnetconf_sessions.bin

rkrejci commented 8 years ago

I'm going to fix it

ntadas commented 8 years ago

thanks a lot :)