Open AlliBalliBaba opened 3 days ago
74 bool wtr_watcher_close(void* watcher)
75 {
76 if (! watcher) return false;
77 if (! ((wtr::watcher::watch*)watcher)->close()) return false;
78 delete (wtr::watcher::watch*)watcher;
79 return true;
80 }
Either the watcher is null or the underlying implementation couldn't close the OS's resources, which is problematic.
Do we have a minimally reproducible example?
I think https://github.com/e-dant/watcher/commit/343e6b5975036d43a3cadc2eaf813584c2b58f3c might fix this, but I'm not sure yet.
There might be some double-close logic (for any reason at all -- implicit destructors if the api is used with "smart pointers" and the user also calls close(), unintentional API use, anything at all -- which that pending check was meant to guard against), but there might have also been a case where the semaphore skips right through the pending state into the released state, which that commit should also address.
LMK if this fixes the issue for you.
This didn't really seem to fix the issue. I'll try to create a minimal reproducer when I have time. It's not that critical for us since we usually stop the whole process after stopping the watcher.
I think this is worth investigation.
It's a problem if we're failing to close resources.
I think a minimal reproducer would be enlightening -- Particularly, I'm curious if there's a bad interaction in the Go<->cgo<->C language chain, or if this is a library bug.
I created a minimal reproducer here: https://github.com/AlliBalliBaba/watcher-debug
With docker you can just clone the repo and do something like:
docker build -t debug-watcher .
docker run debug-watcher
The code that opens and closes the watcher is in watcher-test.c
I created a minimal reproducer here: https://github.com/AlliBalliBaba/watcher-debug
The watcher here is being closed before it has a chance to finish opening all the resources it needs.
The thread running the watcher doesn't usually start in this example until control has reached the close() call.
We could probably handle this better, but we're not leaking anything here.
With some extra logging, we can trace what's happening:
close/1/1727620570393 closing watcher
close/1/1727620570393 releasing semaphore
close/1/1727620570393 waiting for watcher
open/1/1727620570393 watching /go/src/app
open/1/1727620570393 e/self/live@/go/src/app
event:
path name: e/self/live@/go/src/app
path type: 4
open/1/1727620570393 done watching /go/src/app
open/1/1727620570393 e/self/die@/go/src/app
event:
path name: e/self/die@/go/src/app
path type: 4
open/1/1727620570393 watcher closed, ok? no
failure
Format is routine/nth call to it/time of message (etc)
Note that the event handler here receives a "special" path and path type:
event:
path name: e/self/live@/go/src/app
path type: 4
The e/
prefix here means "error". (s/
would mean "success".)
This communication mechanism is detailed here in the The Library/readme
The watcher type is special. Events with this type will include messages from the watcher. You may recieve error messages or important status updates, such as when it first becomes alive and when it dies. The last event will always be a destroy event from the watcher.
However, I'm not sure this fully addresses the issue.
While, in this example, the issue could be solved by waiting for that self/live message (with the special "watcher" path type), I believe you were also seeing errors on close in a more long-running context.
I think we should explore that more.
Thoughts?
Ah yeah that makes sense. I was seeing the error sometimes in tests, probably because the tests would finish before the watcher could start. I tested it again and with a bigger grace period before shutdown and it works properly 👍.
I guess I'm fine with the current behavior then. It could maybe be documented somewhere that wtr_watcher_close
will return false (but still stop the watcher) if the watcher hasn't sent the e/self/live@
event yet
When I'm running the following code with
watcher-c
;success
always isfalse
when closing the watcher, even if the watcher started successfully. From the the testfile I was under the impression that it should returntrue
if stopping is successful.Is thig a bug or is the behaviour expected?