RobotWebTools / ros2-web-bridge

Bridging your browser to the ROS 2.0
Apache License 2.0
205 stars 68 forks source link

free(): invalid next size (fast) #142

Open cr01927 opened 4 years ago

cr01927 commented 4 years ago

I'm getting a free(): invalid next size (fast) error in my implementation. I'm using ros2-web-bridge to interface a web page with my ROS2 backend, where the web page subscribes to about 10 topics. The fastest topics are published at 10Hz. None of the msgs are particularly large (no images or point clouds or anything), the biggest being about 6 topics published at 10Hz with 60-70 geometry_msgs/Points.

The full error msg isn't very insightful - if you have recommendations on more verbose debugging, I'll give it a shot.

...
2020-04-14T13:35:27.626Z ros2-web-bridge:Bridge Send message to subscription.
2020-04-14T13:35:27.628Z ros2-web-bridge:Bridge Send message to subscription.
2020-04-14T13:35:27.709Z ros2-web-bridge:Bridge Send message to subscription.
2020-04-14T13:35:27.719Z ros2-web-bridge:Bridge Send message to subscription.
2020-04-14T13:35:27.719Z ros2-web-bridge:Bridge Send message to subscription.
2020-04-14T13:35:27.747Z ros2-web-bridge:Bridge Send message to subscription.
2020-04-14T13:35:27.766Z ros2-web-bridge:Bridge Send message to subscription.
2020-04-14T13:35:27.781Z ros2-web-bridge:Bridge Send message to subscription.
2020-04-14T13:35:27.781Z ros2-web-bridge:Bridge Send message to subscription.
2020-04-14T13:35:27.806Z ros2-web-bridge:Bridge Send message to subscription.
2020-04-14T13:35:27.919Z ros2-web-bridge:Bridge Send message to subscription.
2020-04-14T13:35:27.920Z ros2-web-bridge:Bridge Send message to subscription.
2020-04-14T13:35:27.988Z ros2-web-bridge:Bridge Send message to subscription.
2020-04-14T13:35:27.996Z ros2-web-bridge:Bridge Send message to subscription.
2020-04-14T13:35:28.019Z ros2-web-bridge:Bridge Send message to subscription.
2020-04-14T13:35:28.070Z ros2-web-bridge:Bridge Send message to subscription.
2020-04-14T13:35:28.111Z ros2-web-bridge:Bridge Send message to subscription.
2020-04-14T13:35:28.115Z ros2-web-bridge:Bridge Send message to subscription.
free(): invalid next size (fast)
Aborted
minggangw commented 4 years ago

Is there a core dump file generated? would you please paste the call stack here, so we can identify the root cause of it, thanks!

cr01927 commented 4 years ago

How's this?

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007fdc77d5e801 in __GI_abort () at abort.c:79
#2  0x00007fdc77da7897 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7fdc77ed4b9a "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3  0x00007fdc77dae90a in malloc_printerr (str=str@entry=0x7fdc77ed6800 "free(): invalid next size (fast)") at malloc.c:5350
#4  0x00007fdc77db5f60 in _int_free (have_lock=0, p=0x3da6480, av=0x7fdc78109c40 <main_arena>) at malloc.c:4213
#5  __GI___libc_free (mem=0x3da6490) at malloc.c:3124
#6  0x00000000009057b6 in (anonymous namespace)::v8impl::Finalizer::FinalizeBufferCallback(char*, void*) ()
#7  0x00000000009133dd in node::Buffer::(anonymous namespace)::CallbackInfo::WeakCallback(v8::WeakCallbackInfo<node::Buffer::(anonymous namespace)::CallbackInfo> const&) ()
#8  0x0000000000ec39a3 in v8::internal::GlobalHandles::DispatchPendingPhantomCallbacks(bool) ()
#9  0x0000000000ec3bca in v8::internal::GlobalHandles::PostGarbageCollectionProcessing(v8::internal::GarbageCollector, v8::GCCallbackFlags) ()
#10 0x0000000000f007ca in v8::internal::Heap::PerformGarbageCollection(v8::internal::GarbageCollector, v8::GCCallbackFlags) ()
#11 0x0000000000f015f4 in v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) ()
#12 0x0000000000f01dca in v8::internal::Heap::FinalizeIncrementalMarkingIfComplete(v8::internal::GarbageCollectionReason) ()
#13 0x0000000000f054a7 in v8::internal::IncrementalMarkingJob::Task::RunInternal() ()
#14 0x0000000000bf6526 in v8::internal::CancelableTask::Run() ()
#15 0x0000000000981ed4 in node::PerIsolatePlatformData::RunForegroundTask(std::unique_ptr<v8::Task, std::default_delete<v8::Task> >) ()
#16 0x0000000000983602 in node::PerIsolatePlatformData::FlushForegroundTasksInternal() ()
#17 0x0000000000a6f65f in uv__async_io (loop=0x264e6e0 <default_loop_struct>, w=<optimized out>, events=<optimized out>) at ../deps/uv/src/unix/async.c:118
#18 0x0000000000a80f08 in uv__io_poll (loop=loop@entry=0x264e6e0 <default_loop_struct>, timeout=106283) at ../deps/uv/src/unix/linux-core.c:379
#19 0x0000000000a6ff9b in uv_run (loop=0x264e6e0 <default_loop_struct>, mode=UV_RUN_DEFAULT) at ../deps/uv/src/unix/core.c:364
#20 0x00000000009045d5 in node::Start(v8::Isolate*, node::IsolateData*, std::vector<std::string, std::allocator<std::string> > const&, std::vector<std::string, std::allocator<std::string> > const&) ()
#21 0x000000000090285f in node::Start(int, char**) ()
#22 0x00007fdc77d3fb97 in __libc_start_main (main=0x8bbd60 <main>, argc=4, argv=0x7ffc2e0516f8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffc2e0516e8)
    at ../csu/libc-start.c:310
#23 0x00000000008bbe95 in _start ()
minggangw commented 4 years ago

Thanks for the detailed information and the crash happened when GC executes, which I think the reason is complicated. Is it reproducible stably (as there are many dependencies, we cannot find the direct connection from the backtrace)?

cr01927 commented 4 years ago

It seems to be fairly random. I could only reproduce it once yesterday in about 6 hours of testing, while this morning it happened on the first run of the software.

cr01927 commented 4 years ago

Happened again, with a different backtrace this time...

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `node /src/ros2-web-bridge/bin/rosbridge.js -p 9090'.
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7fee9629c740 (LWP 320))]
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007fee94d67801 in __GI_abort () at abort.c:79
#2  0x00007fee94db0897 in __libc_message (action=action@entry=do_abort, 
    fmt=fmt@entry=0x7fee94eddb9a "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3  0x00007fee94db790a in malloc_printerr (
    str=str@entry=0x7fee94edf800 "free(): invalid next size (fast)") at malloc.c:5350
#4  0x00007fee94dbef60 in _int_free (have_lock=0, p=0x39bc8d0, av=0x7fee95112c40 <main_arena>)
    at malloc.c:4213
#5  __GI___libc_free (mem=0x39bc8e0) at malloc.c:3124
#6  0x00007fee942fe292 in rclnodejs::Executor::ExecuteReadyHandles() ()
   from /src/ros2-web-bridge/node_modules/rclnodejs/build/Release/rclnodejs.node
#7  0x0000000000a6f65f in uv__async_io (loop=0x264e6e0 <default_loop_struct>, w=<optimized out>, 
    events=<optimized out>) at ../deps/uv/src/unix/async.c:118
#8  0x0000000000a80f08 in uv__io_poll (loop=loop@entry=0x264e6e0 <default_loop_struct>, timeout=7128)
    at ../deps/uv/src/unix/linux-core.c:379
#9  0x0000000000a6ff9b in uv_run (loop=0x264e6e0 <default_loop_struct>, mode=UV_RUN_DEFAULT)
    at ../deps/uv/src/unix/core.c:364
#10 0x00000000009045d5 in node::Start(v8::Isolate*, node::IsolateData*, std::vector<std::string, std::allocator<std::string> > const&, std::vector<std::string, std::allocator<std::string> > const&) ()
#11 0x000000000090285f in node::Start(int, char**) ()
#12 0x00007fee94d48b97 in __libc_start_main (main=0x8bbd60 <main>, argc=4, argv=0x7ffd427a57e8, 
    init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffd427a57d8)
    at ../csu/libc-start.c:310
#13 0x00000000008bbe95 in _start ()
cr01927 commented 4 years ago

And another. Not sure why I'm breaking this in multiple places, there must be something else going on

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `node /src/ros2-web-bridge/bin/rosbridge.js -p 9090'.
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7f7fc798a740 (LWP 320))]
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f7fc6455801 in __GI_abort () at abort.c:79
#2  0x00007f7fc649e897 in __libc_message (action=action@entry=do_abort, 
    fmt=fmt@entry=0x7f7fc65cbb9a "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3  0x00007f7fc64a590a in malloc_printerr (str=str@entry=0x7f7fc65c9c9d "corrupted size vs. prev_size")
    at malloc.c:5350
#4  0x00007f7fc64a5b0c in malloc_consolidate (av=av@entry=0x7f7fc6800c40 <main_arena>) at malloc.c:4456
#5  0x00007f7fc64a97d8 in _int_malloc (av=av@entry=0x7f7fc6800c40 <main_arena>, bytes=bytes@entry=1584)
    at malloc.c:3703
#6  0x00007f7fc64af0b1 in __libc_calloc (n=<optimized out>, elem_size=<optimized out>) at malloc.c:3436
#7  0x00007f7fa13be6c5 in geometry_msgs__msg__Point__Sequence__init ()
   from /opt/ros/eloquent/lib/libgeometry_msgs__rosidl_generator_c.so
#8  0x00007f7fa097e7c2 in _<Proprietary msg name>__cdr_deserialize ()
   from /build/install/<proprietary pkg>/lib/<proprietary pkg>__rosidl_typesupport_fastrtps_c.so
#9  0x00007f7fbc0d411c in ?? () from /opt/ros/eloquent/lib/librmw_fastrtps_cpp.so
#10 0x00007f7fae9e5882 in rmw_fastrtps_shared_cpp::TypeSupport::deserialize(eprosima::fastrtps::rtps::SerializedPayload_t*, void*) () from /opt/ros/eloquent/lib/librmw_fastrtps_shared_cpp.so
#11 0x00007f7fae3fa369 in eprosima::fastrtps::SubscriberHistory::deserialize_change(eprosima::fastrtps::rtps::CacheChange_t*, unsigned int, void*, eprosima::fastrtps::SampleInfo_t*) ()
   from /opt/ros/eloquent/lib/libfastrtps.so.1
#12 0x00007f7fae400199 in eprosima::fastrtps::SubscriberHistory::takeNextData(void*, eprosima::fastrtps::SampleInfo_t*, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >&) () from /opt/ros/eloquent/lib/libfastrtps.so.1
#13 0x00007f7fae3f305c in eprosima::fastrtps::SubscriberImpl::takeNextData(void*, eprosima::fastrtps::SampleInfo_t*) () from /opt/ros/eloquent/lib/libfastrtps.so.1
#14 0x00007f7fae9e2b63 in rmw_fastrtps_shared_cpp::_take(char const*, rmw_subscription_t const*, void*, bool*, rmw_message_info_t*, rmw_subscription_allocation_t*) ()
   from /opt/ros/eloquent/lib/librmw_fastrtps_shared_cpp.so
minggangw commented 4 years ago

This looks weird that it doesn't happen when the GC executes only, but also some other cases (within libraries of rclnodejs or librmw_fastrtps_shared_cpp.so).

Which release of rclnodejs and ROS2 do you use? (latest ones are ROS 2 Eloquent Elusor - Patch Release 1 and 0.13.0), and the node.js version?

If the scale of your app is small, you could convert it into JS and run it by rclnodejs directly to observe the result. (As the rclnodejs is the backend of ros2-web-bridge, I think it is highly possible that the C code of rclnodejs causes this issue).

cr01927 commented 4 years ago

Versions: ROS2: Eloquent, in the official ros:eloquent-ros-core image on DockerHub. I was on the February build, just re-pulled and updated to the March build and will test again today. Node: v10.18.1 rclnodejs: 0.13.0 (pulled in when I npm install the master branch of this package)

The app's pretty large, but I could try a minimal rclnodejs example that consumes the higher-throughput topics that seem to have aggravated the issue.

minggangw commented 4 years ago

I see, so I think you are running ROS2 with the docker image. We ever tested the stability of rclnodejs e.g. pub/sub some topics for several days, but not using the docker image. I agree that you could make a simple case to locate the potential causes, thanks for your support!

minggangw commented 4 years ago

An update: to mimic your case, I run the an example of rclnodejs which publishes a topic of type sensor_msgs/msg/JointState

Publish 161396 messages.
Publish 161397 messages.
Publish 161398 messages.
Publish 161399 messages.

The memory error doesn't happen after two-days running. ENV: Node.js: v12.16.1 rclnodejs: 0.13.0 ROS2: build from scratch with master branch

cr01927 commented 4 years ago

Still trying to investigate this. I'm still getting periodic crashes from the application, but I haven't gotten any crashes in my attempt to reproduce the problem. I'm testing rclnodejs directly with my test program - I might try to write another to test ros2_web_bridge

cr01927 commented 4 years ago

Any advice on how to install rclnodejs with debug symbols? I'm not super familiar with nodejs, but node-gyp keeps building it in release, even when I try to node-gyp debug configure rebuild

minggangw commented 4 years ago

node-gyp rebuild --debug will generate the symbols, also enables some debug info.

cr01927 commented 4 years ago

Still working on debugging this. I set the flag MALLOC_CHECK_=1 and now I consistently get a backtrace of

free(): invalid pointer

(gdb) bt full
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
        set = {__val = {4096, 20204750, 0, 28802919, 140737488331120, 40254761, 0, 6, 140737488331168, 18904274, 0, 1, 140737488331104, 140737488331087, 140737488331112, 140737488331120}}
        pid = <optimized out>
        tid = <optimized out>
        ret = <optimized out>
#1  0x00007ffff6aba801 in __GI_abort () at abort.c:79
        save_stage = 1
        act = {__sigaction_handler = {sa_handler = 0x268ee80, sa_sigaction = 0x268ee80}, sa_mask = {__val = {0, 140737471619584, 28806080, 40453920, 2, 140733193388034, 11539372, 1, 140737488331600, 40882688, 
              140737488346976, 40851920, 4661483999560990720, 140737488331536, 140737488331520, 140737488331808}}, sa_flags = -23808, sa_restorer = 0x1000}
        sigs = {__val = {32, 0 <repeats 15 times>}}
        __cnt = <optimized out>
        __set = <optimized out>
        __cnt = <optimized out>
        __set = <optimized out>
#2  0x00007ffff6b03897 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff6c30b9a "%s\n") at ../sysdeps/posix/libc_fatal.c:181
        ap = {{gp_offset = 24, fp_offset = 0, overflow_arg_area = 0x7fffffffa430, reg_save_area = 0x7fffffffa3c0}}
        fd = <optimized out>
        list = <optimized out>
        nlist = <optimized out>
        cp = <optimized out>
        written = <optimized out>
        on_2 = <optimized out>
        next = <optimized out>
        str = <optimized out>
        len = <optimized out>
        newp = <optimized out>
        iov = <optimized out>
        total = <optimized out>
        cnt = <optimized out>
        buf = <optimized out>
        wp = <optimized out>
        old = <optimized out>
        cnt = <optimized out>
        result = <optimized out>
#3  0x00007ffff6b0a90a in malloc_printerr (str=str@entry=0x7ffff6c2ed88 "free(): invalid pointer") at malloc.c:5350
No locals.
#4  0x00007ffff6b0c84c in free_check (mem=<optimized out>, caller=<optimized out>) at hooks.c:274
        p = <optimized out>
        mem = <optimized out>
        ignore1 = <optimized out>
        ignore3 = <optimized out>
        ignore = <optimized out>
        ignore = <optimized out>
        caller = <optimized out>
        p = <optimized out>
        ignore2 = <optimized out>
        p = <optimized out>
        ignore1 = <optimized out>
        ignore2 = <optimized out>
        ignore3 = <optimized out>
        ignore = <optimized out>
        ignore = <optimized out>
#5  0x00007ffff6b11c27 in __GI___libc_free (mem=0x38338b0) at malloc.c:3094
        ar_ptr = <optimized out>
        p = <optimized out>
        hook = <optimized out>
        __x = <optimized out>
#6  0x00007ffff384c2a2 in rclnodejs::Executor::ExecuteReadyHandles() () from /src/ros2-web-bridge/node_modules/rclnodejs/build/Release/rclnodejs.node
---Type <return> to continue, or q <return> to quit---
No symbol table info available.
#7  0x0000000000a6f65f in uv__async_io (loop=0x264e6e0 <default_loop_struct>, w=<optimized out>, events=<optimized out>) at ../deps/uv/src/unix/async.c:118
        buf = "\001\000\000\000\000\000\000\000\220)j\003\000\000\000\000\034\000\000\000\000\000\000\000\340Yo\002\000\000\000\000\002\000\000\000\000\000\000\000\350Yo\002\000\000\000\000\340\245\377\377\377\$
77\000\000\035\066\214\000\000\000\000\000 Gi\002\000\000\000\000\000\000\000\000\000\360~@\002\000\000\000\000\000\000\000\220)j\003\000\000\000\000\340Yo\002\000\000\000\000\370}o\002\000\000\000\000\060\344\$
35\003\000\000\000\000`\321o\002\000\000\000\000 \246\377\377\377\177\000\000\224\230\234\000\000\000\000\000\330Yo\002\000\000\000\000\250Yo\002\000\000\000\000 \246\377\377\377\177\000\000V\000\000\000\000\00$
\000\000\340)j\003\000\000\000\000\020\334\377\377\377\177\000\000\220\246\377\377\377\177\000\000"...
        r = <optimized out>
        queue = {0x26f28d8, 0x264d888 <node::inspector::(anonymous namespace)::start_io_thread_async+104>}
        q = <optimized out>
        h = <optimized out>
        __PRETTY_FUNCTION__ = "uv__async_io"
#8  0x0000000000a80f08 in uv__io_poll (loop=loop@entry=0x264e6e0 <default_loop_struct>, timeout=2055) at ../deps/uv/src/unix/linux-core.c:379
#9  0x0000000000a6ff9b in uv_run (loop=0x264e6e0 <default_loop_struct>, mode=UV_RUN_DEFAULT) at ../deps/uv/src/unix/core.c:364
#10 0x00000000009045d5 in node::Start(v8::Isolate*, node::IsolateData*, std::vector<std::string, std::allocator<std::string> > const&, std::vector<std::string, std::allocator<std::string> > const&) ()
#11 0x000000000090285f in node::Start(int, char**) ()
#12 0x00007ffff6a9bb97 in __libc_start_main (main=0x8bbd60 <main>, argc=8, argv=0x7fffffffe838, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe828)
    at ../csu/libc-start.c:310
#13 0x00000000008bbe95 in _start ()

Still working on getting rclnodejs built with debug symbols so I can get more info about frame #6

minggangw commented 4 years ago

Thanks for keeping investigating this issue and it seems that the crash happens when executing Executor::ExecuteReadyHandles method. I did a quick check and found that we don't call malloc/free directly when calling ExecuteReadyHandles, I think we free a pointer somewhere which was not managed by us. Would you please locate the line number where the crash occurs, that would be helpful, thanks!

cr01927 commented 4 years ago

Finally got rclnodejs building in debug. Generated this backtrace:

Thread 1 "node" received signal SIGSEGV, Segmentation fault.
__memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:301
301     ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory.

#0  __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:301
No locals.
#1  0x00007ffff4240e06 in std::__copy_move<false, true, std::random_access_iterator_tag>::__copy_m<rclnodejs::RclHandle*> (__first=0x7fffbc000ef0, __last=0x7fffbc000ef8, __result=0x0)
    at /usr/include/c++/7/bits/stl_algobase.h:368
        _Num = 1
#2  0x00007ffff4240da6 in std::__copy_move_a<false, rclnodejs::RclHandle* const*, rclnodejs::RclHandle**> (__first=0x7fffbc000ef0, __last=0x7fffbc000ef8, __result=0x0)
    at /usr/include/c++/7/bits/stl_algobase.h:386
        __simple = true
#3  0x00007ffff4240d2f in std::__copy_move_a2<false, __gnu_cxx::__normal_iterator<rclnodejs::RclHandle* const*, std::vector<rclnodejs::RclHandle*, std::allocator<rclnodejs::RclHandle*> > >, rclnodejs::RclHandle**> (__first=0x2ebf7c0, __last=0x2f70ac0, __result=0x0) at /usr/include/c++/7/bits/stl_algobase.h:424
No locals.
#4  0x00007ffff4240c88 in std::copy<__gnu_cxx::__normal_iterator<rclnodejs::RclHandle* const*, std::vector<rclnodejs::RclHandle*, std::allocator<rclnodejs::RclHandle*> > >, rclnodejs::RclHandle**> (
    __first=0x2ebf7c0, __last=0x2f70ac0, __result=0x0) at /usr/include/c++/7/bits/stl_algobase.h:456
No locals.
#5  0x00007ffff4240be4 in std::__uninitialized_copy<true>::__uninit_copy<__gnu_cxx::__normal_iterator<rclnodejs::RclHandle* const*, std::vector<rclnodejs::RclHandle*, std::allocator<rclnodejs::RclHandle*> > >, rclnodejs::RclHandle**> (__first=0x2ebf7c0, __last=0x2f70ac0, __result=0x0) at /usr/include/c++/7/bits/stl_uninitialized.h:101
No locals.
#6  0x00007ffff4240b20 in std::uninitialized_copy<__gnu_cxx::__normal_iterator<rclnodejs::RclHandle* const*, std::vector<rclnodejs::RclHandle*, std::allocator<rclnodejs::RclHandle*> > >, rclnodejs::RclHandle**>
    (__first=0x2ebf7c0, __last=0x2f70ac0, __result=0x0) at /usr/include/c++/7/bits/stl_uninitialized.h:134
        __assignable = true
#7  0x00007ffff4240965 in std::__uninitialized_copy_a<__gnu_cxx::__normal_iterator<rclnodejs::RclHandle* const*, std::vector<rclnodejs::RclHandle*, std::allocator<rclnodejs::RclHandle*> > >, rclnodejs::RclHandle**, rclnodejs::RclHandle*> (__first=0x2ebf7c0, __last=0x2f70ac0, __result=0x0) at /usr/include/c++/7/bits/stl_uninitialized.h:289
No locals.
#8  0x00007ffff4240589 in std::vector<rclnodejs::RclHandle*, std::allocator<rclnodejs::RclHandle*> >::vector (this=0x7fffffffa890, __x=std::vector of length 2, capacity 8 = {...})
    at /usr/include/c++/7/bits/stl_vector.h:331
No locals.
#9  0x00007ffff42403be in rclnodejs::HandleManager::get_ready_handles (this=0x2ba6160) at ../src/handle_manager.hpp:57
No locals.
#10 0x00007ffff42400dc in rclnodejs::Executor::ExecuteReadyHandles (this=0x2e20b40) at ../src/executor.cpp:231
No locals.
#11 0x00007ffff423f047 in rclnodejs::Executor::DoWork (handle=0x2eb87c0) at ../src/executor.cpp:116
        executor = 0x2e20b40
#12 0x000000000130d8e1 in uv__async_io (loop=0x2a73820 <default_loop_struct>, w=<optimized out>, events=<optimized out>) at ../deps/uv/src/unix/async.c:147
        buf = "\001\000\000\000\000\000\000\000(\026\266\002\000\000\000\000p\251\377\377\377\177\000\000\375ٖ\000\000\000\000\000\020", '\000' <repeats 12 times>, "<\222@\000\000\000\000\000\000\000\000\200\330\345\002\000\000\000\000(\026\266\002\000\000\000\000\320\330\345\002\000\000\000\000\060\345\266\002\000\000\000\000\032\f\000\000\000\000\000\000\240\251\377\377\377\177\000\000T{\252\000\000\000\000\000 \026\266\002\000\000\000\000\260ֶ\002\000\000\000\000\032\f\000\000\000\000\000\000\320\330\345\002\000\000\000\000@\252\377\377\377\177\000\000\325|\252\000\000\000\000\000\060\000\001", '\000' <repeats 21 times>, "pT\360\002\000\000\000\000p\224\254\002\000\000\000\000\b\026"...
        r = <optimized out>
        queue = {0x2ae2548, 0x2a6c908 <node::inspector::(anonymous namespace)::start_io_thread_async+104>}
        q = <optimized out>
        h = <optimized out>
        __PRETTY_FUNCTION__ = "uv__async_io"

Looks like it was raised from the copy constructor called by https://github.com/RobotWebTools/rclnodejs/blob/1fc4e0f0ba05745e1dbe112d858d96b54d83d6b8/src/handle_manager.hpp#L54

ETA: Not sure if this provides any additional information, but I did some digging into the readyhandles member:

(gdb) p *(rclnodejs::RclHandle*)(this->handle_manager_->ready_handles_._M_impl._M_start)@1
$12 = {{
    <Nan::ObjectWrap> = {
      _vptr.ObjectWrap = 0x3346bf0, 
      refs_ = 46492112, 
      handle_ = {
        <v8::Persistent<v8::Object, v8::NonCopyablePersistentTraits<v8::Object> >> = {
          <v8::PersistentBase<v8::Object>> = {
            val_ = 0x3ac2a30
          }, <No data fields>}, <No data fields>}
    }, 
    members of rclnodejs::RclHandle: 
    static constructor = {
      <v8::Persistent<v8::Function, v8::NonCopyablePersistentTraits<v8::Function> >> = {
        <v8::PersistentBase<v8::Function>> = {
          val_ = 0x2b766c0
        }, <No data fields>}, <No data fields>}, 
    pointer_ = 0x30d8e40, 
    parent_ = 0x30d8e40, 
    properties_ = std::map with 50163232 elements<error reading variable: Cannot access memory at address 0x7730064006c00fb>, 
    properties_obj_ = {
      val_ = 0x7fffeaed0110 <Napi::details::FinalizeData<char, (anonymous namespace)::WrapPointer(Napi::Env, char*, unsigned long)::{lambda(Napi::Env, char*)#1}, void>::Wrapper(napi_env__*, void*, napi_env__*)>
    }, 
    deleter_ = {
      <std::_Maybe_unary_or_binary_function<int>> = {<No data fields>}, 
      <std::_Function_base> = {
        static _M_max_size = 16, 
        static _M_max_align = 8, 
        _M_functor = {
          _M_unused = {
            _M_object = 0x0, 
            _M_const_object = 0x0, 
            _M_function_pointer = 0x0, 
            _M_member_pointer = NULL
          }, 
          _M_pod_data = "\000\000\000\000\000\000\000\000\300\211\r\003\000\000\000"
        }, 
        _M_manager = 0x100000100
      }, 
      members of std::function<int()>: 
      _M_invoker = 0xed
    }, 
    children_ = std::set with 51029744 elements<error reading variable: Cannot access memory at address 0xffb81274e5894865>
  }}
cr01927 commented 4 years ago

After testing a few more times, looks like the memory errors are always with the readyhandles vector. I've seen copy constructor and destructor failures. It looks like its always fine in the get_ready_handles() frame, but it enters std::vectors copy constructor/destructor sometime after that, where the vector is junk. The vector itself is copied (so it doesn't reference the member variable), so the readyhandles vector is fine. The copied vector is passed into ShadowNode::Execute by const reference.

Is something about the async methods in ShadowNode::Execute somehow corrupting the vector? I'm not sure how the async stuff works, but could it enter the async methods with a valid vector, but by the time it leaves the async methods, the vector has been deallocated?

minggangw commented 4 years ago

Thanks for logging the useful information and it seems that we get a different backtrace compared with the previous ones. I notice the following information about the first element of the vector:

    children_ = std::set with 51029744 elements<error reading variable: Cannot access memory at address 0xffb81274e5894865>

So, I think that may cause a crash because of memory corruption.

minggangw commented 4 years ago

rclnodejs has a new version v0.14.0, you could upgrade to the latest release to see if it's still reproducible, thanks!

cr01927 commented 4 years ago

Just to close this out - we wrapped up this project before I got to do more investigation on this. One additional thing I noticed was that this error rarely (but still occasionally) happened on a server with much better hardware than my development laptop. There very well could have been unoptimized web client code that was causing this app to use too many resources, or something.

Sorry we couldn't figure this out!

minggangw commented 4 years ago

That's fine, thanks for what you have done to help investigate the root cause of the problem. Actually, based on the call stack you offered, I found a potential race condition when writing/reading data happens on different threads, and a patch has landed.

Hopefully, we are going to have a patch release recently for the Node.js client (rclnodejs). You could have a try if you are still interested then :D Meanwhile, I want to reopen this problem because others may hit the same problem and they can give feedback through this thread or confirm if the issue is still reproducible. Thanks a lot for your time and effort!