SAP / node-rfc

Asynchronous, non-blocking SAP NW RFC SDK bindings for Node.js
Apache License 2.0
251 stars 73 forks source link

RFC_COMMUNICATION_FAILURE #147

Closed clausreinke closed 4 years ago

clausreinke commented 4 years ago

We keep seeing a noticable number of RFC_COMMUNICATION_FAILURE. I mentioned this previously in #128, but I want to explore the possibility that it is a separate issue, occuring independently of the invalid handles.

There is a fairly stable pattern that looks somewhat like this

2020-05-27 11:26:25 post acquire ----> 148 (1627757679712) true 0 {"created":1590571474584,"lastopen":1590571474584,"lastclose":0,"lastcall":0} {
2020-05-27 11:26:25 148 invoked
2020-05-27 11:26:25 post acquire ----> 149 (1627757679712) true 0 {"created":1590571476084,"lastopen":1590571476084,"lastclose":0,"lastcall":0} {
2020-05-27 11:26:25 149 invoked
2020-05-27 11:26:26 post release (error) ----> 149 (undefined) false 0 {"created":1590571476084,"lastopen":1590571476084,"lastclose":1590571586289,"lastcall":1590571585606} {}
2020-05-27 11:26:26 error: communication/network problem (149) {
  "alive": true,
  "name": "RfcLibError",
  "code": 1,
  "codeString": "RFC_COMMUNICATION_FAILURE",
  "key": "RFC_COMMUNICATION_FAILURE",
  "message": {
    "message(split at line breaks)": [
      "LOCATION    CPIC (TCP/IP) on local host with Unicode",
      "ERROR       program state check for conversation 73990548",
      "TIME        Wed May 27 11:26:25 2020",
      "RELEASE     753",
      "COMPONENT   CPIC (TCP/IP) with Unicode",
      "VERSION     3",
      "RC          471",
      "MODULE      D:/depot/bas/753_REL/src/krn/si/cpic/r3cpic.c",
      "LINE        6378",
      "DETAIL      called function STSEND in state state=BUFFER_DATA2",
      "COUNTER     36"
    ]
  }
}
2020-05-27 11:26:26 post release (success) ----> 148 (undefined) false 0 {"created":1590571474584,"lastopen":1590571474584,"lastclose":1590571586948,"lastcall":1590571585534} {}

In explanation, this looks as if

It took me a while to notice that this it just what we see from within the Javascript callbacks, which get called from within the Napi callbacks, so there are gaps in this picture.

Could you please check whether the following scenario makes sense?

  1. 148 acquires handle X
  2. 148 invokes function
  3. 148 closes handle X
  4. 149 acquires handle X
  5. 148 is destroyed, calling RfcCloseConnection on handle X again, because RfcIsConnectionHandleValid returns that handle X is valid (again, because of 4)
  6. 149 tries to invoke function using handle X, which is closed or in the process of being closed

In other words, two clients are working on the same handle, in spite of the semaphores.

bsrdjan commented 4 years ago

Yes, that is exactly the problem found with the current implementation of the pool. The threads spawned by express can get the same handle X from the pool. The pool must be protected against that, by moving the pool functionality to C and synchronising with existing mutex, or using new pool mutex. The existing one should be enough I think. For the time being better not use pool, with express or similar multi-thread frameworks.

PaulWieland commented 4 years ago

I believe I am experiencing the exact same issue with my node-red-contrib-saprfc node. I reported this before, but closed it because I thought I had solved it by using the connection pool. It seems I was wrong.

The way my code works, I setup a connection pool, and I create an async task queue. Then I add "tasks" to this queue, where each task is a call to a node-rfc.

The queue has a processor that will execute each item in the queue. The number of parallel tasks that can be processed is configurable.

When pushing 1000 BAPI_MATERIAL_SAVEDATA tasks to the queue, with a parallel processing limit of 4, a seemingly random number of tasks will process successfully before I get a segmentation fault.

When I reduce the parallel processing limit to 1, it seems that all tasks complete successfully (albeit extremely slowly). I say "seems", because I have also had RFC_INVALID_HANDLE errors pop up in this way too and I have to do further testing to try and figure out whats going on.

I presently cannot think of any way to change the behavior of my program to prevent the segmentation faults.

Incidentally, I was trying to setup node's diagnostic report to try and get more details on the seg fault, but for some reason no report is generated. I'm starting node with --report-uncaught-exception --report-on-fatalerror --report-filename=./node-report.json --report-directory=/data

clausreinke commented 4 years ago

My theory is that the 1:1 relation between Clients and connectionHandles gets messed up (by an explicit or implicit close), at which point the Client mutexes no longer protect against multiple threads using the same connectionHandle.

To test this theory, we have a patched version of node-rfc where we

  1. avoid closing a handle in the Client destructor if the Client may have lost exclusivity for that handle
  2. try to tighten the management of Client->alive to more accurately track the handle status (reduce the time windows where multiple threads might see incorrect values)

Good news first: In the two days, we've tested that patched version, we've seen no more invalid handles #128 or communication failures #147, so there is something to the theory.

More details: We've only tested for two days so far, and while the patch may be overly paranoid in some places, it leaves gaps in other places. So I have not yet made a PR out of it, but if you want to test it early yourself, it is https://github.com/tangro/node-rfc/tree/trackHandleStatus . You'll have to make sure to bypass the prebuild-install or you'll not be running the patched version - ok that alone is perhaps enough reason to make a PR (even if the patch is incomplete, it does seem to be an improvement).

Suprising news In these two days, we've also not seen any more silent crashes, though I do not yet have a theory why that might be.

For our windows servers, they did not represent as Segmentation faults, but as fast exits with code C000 0409. According to https://devblogs.microsoft.com/oldnewthing/20190108-00/?p=100655 , it is no longer certain that this means a STATUS_STACK_BUFFER_OVERRUN. However, we've also been unable to catch or predict them and there has not been sufficient info, let alone reproducability, to open a separate issue even though these crashes have been annoying.

What little I can say from the RFC traces (in the couple of cases where the client or server trace was on at the time and I've been able to locate the incident time) is that these crashes happen during client.invoke: the client successfully connects, sends parameters and even receives the first block of result data, then it disappears without further trace.

PaulWieland commented 4 years ago

@clausreinke I got your PR to compile and tried testing it today.. however as soon as my program tries to pool.aquire, it crashes.

EDIT: Ignore this. There was a bug in my own code causing the problem. I found & corrected it, running tests now...

PaulWieland commented 4 years ago

EDIT: Ignore this

Still having the same problem.

[sapRFC:call]  {
  alive: true,
  name: 'RfcLibError',
  code: 13,
  codeString: 'RFC_INVALID_HANDLE',
  key: 'RFC_INVALID_HANDLE',
  message: "An invalid handle 'RFC_CONNECTION_HANDLE' was passed to the API call"
}
double free or corruption (out)
Aborted
npm ERR! code ELIFECYCLE
PaulWieland commented 4 years ago

I have to retract my previous post.

I installed the PR version of node-rfc globally and then installed my app. For some reason, npm ignored the global version of node-rfc and instead installed a local version of 1.2.0 from the registry.

After manually deleting the local version, my app ran with the PR149 version and my program completed without errors.

Nice work @clausreinke.

@bsrdjan I would love to see this PR merged.

EDIT: For what it's worth, I called BAPI_MATERIAL_SAVEDATA 2000 times using an async queue with 4 parallel tasks running at a time.

bsrdjan commented 4 years ago

Merged, thanks a lot @clausreinke for this contribution.

PaulWieland commented 4 years ago

@bsrdjan When should we expect a new release with this patch included?

bsrdjan commented 4 years ago

Just published, feel free to test.

bsrdjan commented 4 years ago

Just checking the status here. Any updates in the meantime, any issues left ?

bsrdjan commented 4 years ago

Let us close and reopen if needed.