Segmentation fault (core dumped)

Rafael-006 commented 2 years ago

Describe the bug I am using connection pool to acquire, call (With timeout) and release connection. On timeout, I receive the event sapnwrfc:clientCancel.

I get a Segmentation fault (core dumped) error and my node application crashes which needs a restart to work properly.

To Reproduce Version of node-rfc - 2.5.1 installed locally

To reproduce the issue,

Use Connection Pool with following configuration,

poolOptions: { low: 2, high: 10 }

Acquire a connection and invoke the API call with timeout of 10 seconds.
Initial call should work as expected (Connections are acquired, invoked and released successfully for typical work flow).
Next disconnect from the internet and make two API calls which uses async and await method.
This should raise the event sapnwrfc:clientCancel for first API call.
For the second API call, it shows the error - Segmentation fault (core dumped)

Connection handle details are as follows,

Initial call connection handle - 139949697433968
Other two API call connection handle are - 139949697433968 & 139949965870272

bsrdjan commented 2 years ago

Just to check if my understanding of the flow is correct:

139949697433968 & 139949965870272 are acquired and initial call with 139949697433968 completed regularly
Network disconnected
Call with 139949965870272 ends with cancel timeout
Call with 139949697433968 ends with segfault?

Could you please confirm which of two connection handles triggers the segfault?

And could you please record and attach here the detailed log, from initial call to segfault, by setting the trace level '3' in the DEFAULT section of sapnwrfc.ini. Example:

sapnwrfc.ini

DEFAULT
RFC_TRACE=3 # 'global' trace level

DEST=QM7
USER=demo
PASSWD=welcome
ASHOST=ldciqm7
SYSNR=00
CLIENT=005
LANG=EN
# RFC_TRACE=2 # 'local' trace level

Rafael-006 commented 2 years ago

Just to check if my understanding of the flow is correct:

139949697433968 & 139949965870272 are acquired and initial call with 139949697433968 completed regularly

Network disconnected

Call with 139949965870272 ends with cancel timeout

Call with 139949697433968 ends with segfault?

Could you please confirm which of two connection handles triggers the segfault?

And could you please record and attach here the detailed log, from initial call to segfault, by setting the trace level '3' in the DEFAULT section of sapnwrfc.ini. Example:

sapnwrfc.ini
DEFAULT
RFC_TRACE=3 # 'global' trace level

DEST=QM7
USER=demo
PASSWD=welcome
ASHOST=ldciqm7
SYSNR=00
CLIENT=005
LANG=EN
# RFC_TRACE=2 # 'local' trace level

139949697433968 - Completed regularly.
Network disconnected.
139949697433968 - Cancel timeout event is caught.
139949965870272 - Ends with segfault.

Rafael-006 commented 2 years ago

Just to check if my understanding of the flow is correct:

139949697433968 & 139949965870272 are acquired and initial call with 139949697433968 completed regularly

Network disconnected

Call with 139949965870272 ends with cancel timeout

Call with 139949697433968 ends with segfault?

Could you please confirm which of two connection handles triggers the segfault?

And could you please record and attach here the detailed log, from initial call to segfault, by setting the trace level '3' in the DEFAULT section of sapnwrfc.ini. Example:

sapnwrfc.ini
DEFAULT
RFC_TRACE=3 # 'global' trace level

DEST=QM7
USER=demo
PASSWD=welcome
ASHOST=ldciqm7
SYSNR=00
CLIENT=005
LANG=EN
# RFC_TRACE=2 # 'local' trace level

Here is the log at trace level 3

Screenshot from 2021-09-06 22-46-27

bsrdjan commented 2 years ago

Could you please attach complete log files or email to me?

Can you figure out if segfault occurs for the connection handle from initial call or another one?

Rafael-006 commented 2 years ago

Could you please attach complete log files or email to me?

Can you figure out if segfault occurs for the connection handle from initial call or another one?

Here is a scenario when we try to do cancelClient, we get error. Kindly find the log attached,

FATAL ERROR: HandleScope::HandleScope Entering the V8 API without proper locking in place 1: 0xa222f0 node::Abort() [node] 2: 0x96411f node::FatalError(char const*, char const*) [node] 3: 0xb97c3a v8::Utils::ReportApiFailure(char const*, char const*) [node] 4: 0xb9970c v8::EscapableHandleScope::EscapableHandleScope(v8::Isolate*) [node] 5: 0x9dfeb6 napi_open_escapable_handle_scope [node] 6: 0x7f134b3b9934 node_rfc::rfcSdkError(_RFC_ERROR_INFO*, node_rfc::_RfmErrorPath*) [/home/project/node_modules/node-rfc/lib/binding/sapnwrfc.node] 7: 0x7f134b3d4cf4 node_rfc::Cancel(Napi::CallbackInfo const&) [/home/project/node_modules/node-rfc/lib/binding/sapnwrfc.node] 8: 0x7f134b3d6bf2 Napi::details::CallbackData<Napi::Value (*)(Napi::CallbackInfo const&), Napi::Value>::Wrapper(napi_env__*, napi_callback_info__*) [/home/project/node_modules/node-rfc/lib/binding/sapnwrfc.node] 9: 0x9d7e5f [node] 10: 0xc03d0b [node] 11: 0xc052b6 [node] {"level":"info","time":"Wed Sep 08 2021 12:53:15 GMT+0000","pid":61122,"hostname":"admin","name":"Server","reqId":"req-2","res":{"url":"/v1/login","method":"POST","statusCode":200},"responseTime":14798.31078099832,"msg":"request completed"} 12: 0xc05936 v8::internal::Builtin_HandleApiCall(int, unsigned long*, v8::internal::Isolate*) [node] 13: 0x1423359 [node] Aborted (core dumped)

bsrdjan commented 2 years ago

Sorry, I meant the RFC trace log, captured in *.trc files

bsrdjan commented 2 years ago

Could you please send the source script to reproduce the issue?

samuel-c-allan commented 2 years ago

Fun fact, I have the same issue but wrapping the code inside addon.cc:Cancel inside a mutex prevents this from happening (on the other hand it generates a segfault for a different reason).

@bsrdjan quick question re this topic - is it expected behavior that RegisterModule is called more than once? With the help of gdb I have determined that when cancelling a client it is in fact called twice (if there are two cancellations)

samuel-c-allan commented 2 years ago

Upon further inspection I'm either missing something or some very illegal behavior is happening. As far as I understand a worker thread is created in terminate to spawn cancel.ts which then requires the addon bindings.. This is why there is a second call to RegisterModule..

But why do we need to spawn a worker thread to cancel something? Can't it be done on the same thread? Just calling cancel directly?

samuel-c-allan commented 2 years ago

Indeed, cancelling the client in the main thread with a simple call to noderfc_binding.cancel completely resolves this issue for me.

EDIT: Just to elaborate on why I called the behavior illegal, in node-rfc the environment is cached (node_rfc::__env) and the scope inside rfcSdkError which is invoked during a call to Cancel uses this cached environment. However, when called from a worker thread the environment is not the main caller's environment, hence a segfault may occur since you are trying to create an EscapableScope with the environment node_rfc::__env in a worker thread environment.

(As a side note it is generally not recommended to cache the environment, though I can understand doing it in node-rfc as libsapnwrfc generally does not operate well with multiple programs / independent threads using it)

bsrdjan commented 2 years ago

But why do we need to spawn a worker thread to cancel something? Can't it be done on the same thread? Just calling cancel directly?

From SAP NWRFC SDK doxygen documentation of RfcCancel method:

Cancels the RFC call which is currently being called over the given RFC connection and closes the connection. Can be used only on an RFC client connection and needs to be called from a different thread than the one currently executing the RFC call.

It is definitely a bug, need to investigate a clean solution here.

samuel-c-allan commented 2 years ago

Ok, but then you can spawn a simple thread for that purpose alone without inuring the entire separate environment of a worker thread, no? Just using regular pthreads (or in this case uv since it fits nicely with node stuff anyway). So during cancel you simply spawn a new C++ thread (no workers), then set up a conditional variable or whatever your favorite mutex strategy is wait for the result.

Does that sound cleaner? I personally would certainly choose to go that route

bsrdjan commented 2 years ago

@samuel-c-allan, yes it sound good and shall be tested. Would you like to create a PR?

samuel-c-allan commented 2 years ago

@bsrdjan I would love to :) Let me code up something, submit a PR and then you can check whether it is presentable

samuel-c-allan commented 2 years ago

@bsrdjan Quick question - are you fine if we go with std::thread instead? While I'm able to get an example working with uv, it involves C-style function pointers and while I am very comfortable working with them it kind of goes against the whole higher-level C++ style of the project. If you don't particularly care about it I'll go with the approach I'm using right now but if you prefer more C++ style code I'll switch to std::thread

SAP / node-rfc

Segmentation fault (core dumped) #234