Open n-eiling opened 6 months ago
I am not fully grasping it :/ I think we already have a STOPPING
state to signal an orderly shutdown.
Its true that the state member is currently not really thread-safe. That is why we get the warnings from Valgrind. We should probably replace it with an std::atomic
?
But apart from that, I can yet see where there is a race.
I don't think std::atomic
fixes it, because the state is checked only in the beginning of the read()
and write()
. I think some kind of handshake between the threads would be better, i.e., SuperNode::stop
signals all threads to stop, and blocks until they are joined.
Ctrl+C sometime leads to segfaults or other issues (vfio container being used after they have been destroyed). I think the error here is only part of the problem, because some objects seem to also get destroyed before the Path threads finish.
This is not valgrid btw. Its libtsan.
Maybe we better discuss with with a coffee in person?
My rationale when writing this code was:
state = STOPPING
.state = STOPPED
.state == STOPPED
.I am not sure if its properly implemented. But that was at least my initial idea :D
ok this makes sense. But what if there are multiple paths using a single node object? That is allowed, isn't it?
I am not entirely sure if it is correctly implemented. Something leads to these segmentation faults or use after frees.
But what if there are multiple paths using a single node object? That is allowed, isn't it?
It can happen. E.g. one path is reading from a node object and another one is writing.
I am not entirely sure if it is correctly implemented. Something leads to these segmentation faults or use after frees.
I agree this needs some more detailed look.. May @PJungkamp wants to chime in :D?
We should also check the sequencing of terminating paths before nodes. I think currently we are terminating paths before nodes?
The idea would be that we first try to terminate all running threads, but leave the nodes still active. Once we are back to the main-thread only, we can terminate the nodes sequentially without any risk of races.
I haven't tried other Node types than fpga. In fpga read()
is blocking. I think this is not how most node types are implemented, but this is the lowest latency way. Maybe this is (part) of the issue.
Can you guarantee that read()
is periodically unblocked? So the path can check its state regularly? If not, the thread can become stuck :/
For other node-types which are blocked by a syscall (e.g. a Socket read()
) we send a cancelation signal to the thread via pthread_cancel()
to unblock them.
When a node is stopped (e.g., on ctrl+C),
SuperNode::stop
writes toNode::state
, which is not synchronized. ThePath
threads concurrently accessNode::state
and assume the Node remains valid for the duration of awrite()
orread()
call. We shouldn't use a mutex inread()
andwrite()
though, because this will cause a lot of latency in the performance critical path.One solution could be to implement a signal handler for the Path threads that stops them before
SuperNode::stop()
accesses the Node.ThreadSanitizer output: