Data Race when accessing Node::state during stopping of nodes

n-eiling commented 6 months ago

When a node is stopped (e.g., on ctrl+C), SuperNode::stop writes to Node::state, which is not synchronized. The Path threads concurrently access Node::state and assume the Node remains valid for the duration of a write() or read() call. We shouldn't use a mutex in read() and write() though, because this will cause a lot of latency in the performance critical path.

One solution could be to implement a signal handler for the Path threads that stops them before SuperNode::stop() accesses the Node.

ThreadSanitizer output:

==================
WARNING: ThreadSanitizer: data race (pid=4962)
  Write of size 4 at 0x7b5800001f40 by main thread:
    #0 villas::node::Node::setState(State) ../include/villas/node.hpp:253 (libvillas.so.1+0x480018)
    #1 villas::node::Node::stop() ../lib/node.cpp:225 (libvillas.so.1+0x47dd13)
    #2 villas::node::SuperNode::stopNodes() ../lib/super_node.cpp:361 (libvillas.so.1+0x4c48bd)
    #3 villas::node::SuperNode::stop() ../lib/super_node.cpp:397 (libvillas.so.1+0x4c4dd0)
    #4 villas::node::tools::Node::daemon() ../src/villas-node.cpp:206 (villas-node+0x434e0c)
    #5 villas::node::tools::Node::main() ../src/villas-node.cpp:185 (villas-node+0x434cbb)
    #6 villas::Tool::run() ../common/lib/tool.cpp:55 (libvillas-common.so.1+0x183058)
    #7 main ../src/villas-node.cpp:219 (villas-node+0x432857)

  Previous read of size 4 at 0x7b5800001f40 by thread T3:
    #0 villas::node::Node::read(villas::node::Sample**, unsigned int) ../lib/node.cpp:260 (libvillas.so.1+0x47dede)
    #1 villas::node::PathSource::read(int) ../lib/path_source.cpp:58 (libvillas.so.1+0x490721)
    #2 villas::node::Path::runSingle() ../lib/path.cpp:62 (libvillas.so.1+0x495fcf)
    #3 villas::node::Path::runWrapper(void*) ../lib/path.cpp:44 (libvillas.so.1+0x495f49)

  Location is heap block of size 648 at 0x7b5800001e00 allocated by main thread:
    #0 operator new(unsigned long) <null> (libtsan.so.0+0x900b2)
    #1 villas::node::FpgaNodeFactory::make(unsigned char const (&) [16], std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ../include/villas/nodes/fpga.hpp:70 (libvillas.so.1+0x57ebde)
    #2 villas::node::NodeFactory::make(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned char const (&) [16], std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ../lib/node.cpp:465 (libvillas.so.1+0x47f6d0)
    #3 villas::node::SuperNode::parse(json_t*) ../lib/super_node.cpp:146 (libvillas.so.1+0x4c2cbf)
    #4 villas::node::SuperNode::parse(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ../lib/super_node.cpp:66 (libvillas.so.1+0x4c20f7)
    #5 villas::node::tools::Node::daemon() ../src/villas-node.cpp:197 (villas-node+0x434d9d)
    #6 villas::node::tools::Node::main() ../src/villas-node.cpp:185 (villas-node+0x434cbb)
    #7 villas::Tool::run() ../common/lib/tool.cpp:55 (libvillas-common.so.1+0x183058)
    #8 main ../src/villas-node.cpp:219 (villas-node+0x432857)

  Thread T3 (tid=4982, running) created by main thread at:
    #0 pthread_create <null> (libtsan.so.0+0x61748)
    #1 villas::node::Path::start() ../lib/path.cpp:554 (libvillas.so.1+0x49a863)
    #2 villas::node::SuperNode::startPaths() ../lib/super_node.cpp:273 (libvillas.so.1+0x4c4029)
    #3 villas::node::SuperNode::start() ../lib/super_node.cpp:338 (libvillas.so.1+0x4c465d)
    #4 villas::node::tools::Node::daemon() ../src/villas-node.cpp:204 (villas-node+0x434dec)
    #5 villas::node::tools::Node::main() ../src/villas-node.cpp:185 (villas-node+0x434cbb)
    #6 villas::Tool::run() ../common/lib/tool.cpp:55 (libvillas-common.so.1+0x183058)
    #7 main ../src/villas-node.cpp:219 (villas-node+0x432857)

SUMMARY: ThreadSanitizer: data race ../include/villas/node.hpp:253 in villas::node::Node::setState(State)
==================
11:22:02 info path:0           Stopping path: fpga_0 => fpga_0
==================
WARNING: ThreadSanitizer: data race (pid=4962)
  Write of size 4 at 0x7b5c00000000 by main thread:
    #0 villas::node::Path::stop() ../lib/path.cpp:571 (libvillas.so.1+0x49aac8)
    #1 villas::node::SuperNode::stopPaths() ../lib/super_node.cpp:351 (libvillas.so.1+0x4c479e)
    #2 villas::node::SuperNode::stop() ../lib/super_node.cpp:398 (libvillas.so.1+0x4c4ddc)
    #3 villas::node::tools::Node::daemon() ../src/villas-node.cpp:206 (villas-node+0x434e0c)
    #4 villas::node::tools::Node::main() ../src/villas-node.cpp:185 (villas-node+0x434cbb)
    #5 villas::Tool::run() ../common/lib/tool.cpp:55 (libvillas-common.so.1+0x183058)
    #6 main ../src/villas-node.cpp:219 (villas-node+0x432857)

  Previous read of size 4 at 0x7b5c00000000 by thread T3:
    #0 villas::node::Path::runSingle() ../lib/path.cpp:59 (libvillas.so.1+0x49607a)
    #1 villas::node::Path::runWrapper(void*) ../lib/path.cpp:44 (libvillas.so.1+0x495f49)

  Location is heap block of size 776 at 0x7b5c00000000 allocated by main thread:
    #0 operator new(unsigned long) <null> (libtsan.so.0+0x900b2)
    #1 villas::node::SuperNode::parse(json_t*) ../lib/super_node.cpp:173 (libvillas.so.1+0x4c3049)
    #2 villas::node::SuperNode::parse(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ../lib/super_node.cpp:66 (libvillas.so.1+0x4c20f7)
    #3 villas::node::tools::Node::daemon() ../src/villas-node.cpp:197 (villas-node+0x434d9d)
    #4 villas::node::tools::Node::main() ../src/villas-node.cpp:185 (villas-node+0x434cbb)
    #5 villas::Tool::run() ../common/lib/tool.cpp:55 (libvillas-common.so.1+0x183058)
    #6 main ../src/villas-node.cpp:219 (villas-node+0x432857)

  Thread T3 (tid=4982, running) created by main thread at:
    #0 pthread_create <null> (libtsan.so.0+0x61748)
    #1 villas::node::Path::start() ../lib/path.cpp:554 (libvillas.so.1+0x49a863)
    #2 villas::node::SuperNode::startPaths() ../lib/super_node.cpp:273 (libvillas.so.1+0x4c4029)
    #3 villas::node::SuperNode::start() ../lib/super_node.cpp:338 (libvillas.so.1+0x4c465d)
    #4 villas::node::tools::Node::daemon() ../src/villas-node.cpp:204 (villas-node+0x434dec)
    #5 villas::node::tools::Node::main() ../src/villas-node.cpp:185 (villas-node+0x434cbb)
    #6 villas::Tool::run() ../common/lib/tool.cpp:55 (libvillas-common.so.1+0x183058)
    #7 main ../src/villas-node.cpp:219 (villas-node+0x432857)

SUMMARY: ThreadSanitizer: data race ../lib/path.cpp:571 in villas::node::Path::stop()
==================

stv0g commented 6 months ago

I am not fully grasping it :/ I think we already have a STOPPING state to signal an orderly shutdown.

Its true that the state member is currently not really thread-safe. That is why we get the warnings from Valgrind. We should probably replace it with an std::atomic?

But apart from that, I can yet see where there is a race.

n-eiling commented 6 months ago

I don't think std::atomic fixes it, because the state is checked only in the beginning of the read() and write(). I think some kind of handshake between the threads would be better, i.e., SuperNode::stop signals all threads to stop, and blocks until they are joined.

Ctrl+C sometime leads to segfaults or other issues (vfio container being used after they have been destroyed). I think the error here is only part of the problem, because some objects seem to also get destroyed before the Path threads finish.

n-eiling commented 6 months ago

This is not valgrid btw. Its libtsan.

stv0g commented 6 months ago

Maybe we better discuss with with a coffee in person?

My rationale when writing this code was:

Main thread signals all nodes/paths to shut down by setting state = STOPPING.
All nodes and path threads detect this during their next read/write calls and exit their main loop.
The threads should also set state = STOPPED.
The main threads waits until all threads have been stopped by waiting until state == STOPPED.

I am not sure if its properly implemented. But that was at least my initial idea :D

n-eiling commented 6 months ago

ok this makes sense. But what if there are multiple paths using a single node object? That is allowed, isn't it?

I am not entirely sure if it is correctly implemented. Something leads to these segmentation faults or use after frees.

stv0g commented 6 months ago

But what if there are multiple paths using a single node object? That is allowed, isn't it?

It can happen. E.g. one path is reading from a node object and another one is writing.

I am not entirely sure if it is correctly implemented. Something leads to these segmentation faults or use after frees.

I agree this needs some more detailed look.. May @PJungkamp wants to chime in :D?

We should also check the sequencing of terminating paths before nodes. I think currently we are terminating paths before nodes?

stv0g commented 6 months ago

The idea would be that we first try to terminate all running threads, but leave the nodes still active. Once we are back to the main-thread only, we can terminate the nodes sequentially without any risk of races.

n-eiling commented 6 months ago

I haven't tried other Node types than fpga. In fpga read() is blocking. I think this is not how most node types are implemented, but this is the lowest latency way. Maybe this is (part) of the issue.

stv0g commented 6 months ago

Can you guarantee that read() is periodically unblocked? So the path can check its state regularly? If not, the thread can become stuck :/

For other node-types which are blocked by a syscall (e.g. a Socket read()) we send a cancelation signal to the thread via pthread_cancel() to unblock them.

VILLASframework / node

Data Race when accessing Node::state during stopping of nodes #749