Closed ntadas closed 8 years ago
Are you on current master (59f5ba5)? The mentioned lines do not match what you describe (e.g. 1418 is comment). Ideally, try to test it with deadlockfix
branch which changes nc_session_free()
function.
The session == NULL
check is just a parameter check to avoid referencing NULL pointer. The session is freed at the end of the function, but since the app holds the variable, it is up to it to set NULL to the variable (or do not use the variable after calling nc_session_free()). Internal libnetconf functions actually never call nc_session_free()
on the sessions avavilable to the application. In case of error, the session is invalidated by nc_session_close()
, but the structure is still available to the application which is supposed to call nc_session_free()
to finnish the cleanup. Valgrind's stack at the moment of crash would be very helpful, but I understand that it is challenging to reproduce the issue.
yes I'm in the master branch, sorry I've added some debug logs and forgot about them when writting the issue. session.c line 1418 is in fact 1423 : free(session->stats);
valgrind is not possible right now because I'm only able to reproduce this in target (ppc) until now I was not able to reproduce it in x86.
regarding setting the variable to NULL sure, I'm doing it. but how can I check if I can call the nc_session_free method? is it possible that the lib is closing/deleting the session and I'm trying to delete a session that doesn't exist? from the debug I've done, when this happens this if: if (strcmp(litem->session_id, session->session_id) == 0) is never true, so when this one is: if (litem->offset_next == 0) we get the crash here: free(session->stats);
Actually you can call nc_session_free()
anytime you want. When an error is detected by some internal function, the session's transport session and some other stuff is destroyed and session's status is set to NC_SESSION_STATUS_CLOSED (you can check this via nc_session_get_status()
). From this time, libnetconf will not able to use the session for anything meaningfull (getter functions should work, but you won't be able to send anything via the session, it is closed).
I have one idea about the issue, just please answer me the following questions:
nc_init()
? I mean, what flags
do you pass to the function/proc/
filesystem?1- I call nc_init with (NC_INIT_ALL & ~NC_INIT_NACM) | NC_INIT_MULTILAYER 2- Windriver Linux 32 bits, yes we have a /proc/ filesystem
hmm, then it is weird that the session is not found in the session_list
(session.c:1399-1402). There are only 2 situations when the session is removed from the list:
nc_session_free()
, but that can be called just once on the sessionnc_session_alive_monitor()
, but it should happen only if the process accepted the session is not alive (checked by PID and records in /proc/
- session.c:497)Calling free(session->stats)
when all items in session_list
were checked and the session wasn't found there should be correct. Without session monitoring, the session structure has some allocated memory in stats, so it is supposed to be free. But in your case you shouldn't get there. According to nc_init()
, the session monitoring is enabled, so the stats points to the shared memory block and the session should be always in the session_list
- as I mentioned, the only other way to remove the session from the list is that the process that accepted the session is no more running and it was detected by nc_session_alive_monitor()
(which runs whenever you get status data by <get>
operation)..
or could it be that during the delete/create of other sessions the prev and next offset are not correctly linked so the session that I'm trying to delete is not found while iterating in the list but in fact its there (just not linked to the others)?
yes, bug in session_list
manipulation would explain that
after adding some logs I found something strange. it seems that this some session are being closed by nc_session_monitor_alive_check but I really don't understand why, since the process that launches all the sessions is the same.
Do you know which part of the nc_session_monitor_alive_check()
causes it? I mean, which of those 3 nc_session_monitor_remove()
calls is used.
unfortunately not, I add prints in nc_session_monitor_remove and in nc_session_free only.
I notice that nc_session_monitor_remove is called not from the nc_session_free and when session free is being called (afterwards) I get negative session_list->count
also this is only happening in ppc.
its the last one (pfd == NULL)
this happens everytime I do a get all on target (ppc) and the number of sessions is decreased.
can you provide the output of the following command on the target system? The PID
is PID number of your server process accepting the NETCONF sessions.
sudo ls -l /proc/PID/fd
There should be a symlink to libnetconf_sessions.bin
file which should correspond to the aux
string in the nc_session_monitor_alive_check()
function.
If not, then, during the libnetconf init, there should be some error message from libnetconf about "sessions monitoring file" or the "session list".
I think I got the problem:
lrwx------ 1 root root 64 May 3 00:41 20 -> /var/volatile/tmp/netconf/libnetconf_sessions.bin lrwx------ 1 root root 64 May 3 00:41 21 -> /var/volatile/tmp/netconf/libnetconf_sessions.bin
but the files aux is /tmp/netconf/libnetconf_sessions.bin and the file exists in /tmp/netconf/libnetconf_sessions.bin (no error in init)
if I do a "readlink -f /tmp/mf_communication_log.log" returns: /var/volatile/tmp/mf_communication_log.log
/tmp in my system is a link to volatile/tmp
I think in the lib code you should have a readlink of the aux to get the canonical file name
or I can change it to /var/volatile/tmp/netconf/libnetconf_sessions.bin
I'm going to fix it
thanks a lot :)
Hi,
I have sometimes a crash, when some network problems occur and I have several sessions connected and monitored at the same time. this is not so easy to reproduce, so also not easy to analyse. Here is what I already found until now and my suspicions:
with this evidences it seems that the lib detects the failure before my server and deletes the session. then I try to delete the same session and I get this crash (at least this is my theory for now).
In the beginning of nc_session_free we can see the check if (session == NULL) but session is never set to NULL inside the lib (or at least I didn't find where).
The struct nc_session* that I give to the nc_session_free is obtained when the session is establish with the method nc_session_accept_inout
Since the session is never set to NULL inside the lib, should I check some flag inside nc_session struct before I call the nc_session_free?
thanks Best Regards