CESNET / netopeer2

NETCONF toolset
BSD 3-Clause "New" or "Revised" License
299 stars 188 forks source link

SIGSEGV on cancel commit #1372

Closed pfeige closed 1 year ago

pfeige commented 1 year ago

Hi Michal,

when the following command sequence is executed:

connect --login user
edit-config --target candidate --config
commit --confirmed --persist 42
disconnect
connect --login user
cancel-commit --persist 42

the netopeer2-server crashes with SIGSEGV:

Core was generated by `/usr/sbin/netopeer2-server -d -v 2 -t 180'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000055ba81a0e82d in np_acquire_user_sess (ncs=0x0, user_sess=0x7fffecf41410) at /tmp/repo/Netopeer2/src/common.c:197
197     ATOMIC_INC_RELAXED(us->ref_count);
(gdb) bt
#0  0x000055ba81a0e82d in np_acquire_user_sess (ncs=0x0, user_sess=0x7fffecf41410) at /tmp/repo/Netopeer2/src/common.c:197
#1  0x000055ba81a19d68 in ncc_changes_rollback_cb (sev=...) at /tmp/repo/Netopeer2/src/netconf_confirmed_commit.c:418
#2  0x000055ba81a1a901 in ncc_try_restore () at /tmp/repo/Netopeer2/src/netconf_confirmed_commit.c:692
#3  0x000055ba81a0aff7 in server_init () at /tmp/repo/Netopeer2/src/main.c:631

The problem seems to be the use of the old nc_session stored in commit_ctx which is not valid any more after disconnect. Can you have a look at this?

BR, Peter

michalvasko commented 1 year ago

Yep, seems I have not accounted for this situation at all. Fixing it should not be difficult but I am not sure what should happen in the following scenario:

connect --login user
lock --target running
edit-config --target candidate --config
commit --confirmed --persist 42

and then another client connects:

connect --login user
cancel-commit --persist 42

But I think the only correct solution would be to use the second NC session for the rollback meaning it is going to fail because running is locked by the previous session. Do you agree?

pfeige commented 1 year ago

Yes, I agree. The lock is held by a different session and so the cancel-commit should be rejected.

michalvasko commented 1 year ago

So your use-case should now work, test included. Also, mine should as well but just one specific situation is handled (NC sess 1 locks running and NC sess 2 issues cancel-commit). There are lots of other similar situations when the rollback is not possible to perform, the worst is probably when some sysrepo application locks running. Then, it has nothing to do with NETCONF but the datastore is not accessible and hence the rollback (issued for any reason) impossible. I do not really see how to improve this except for, maybe, waiting for some timeout in hope that the lock is released. Currently, no timeout is used, I believe.

pfeige commented 1 year ago

Yes it works. Thanks! You are right, there are a lot of difficult situations where a rollback might not be possible. From my point of view it is important that the rollback on timeout works when only one session is involved regardless whether the datastore is locked by that session or not.