RFC_INVALID_HANDLE - Githubissues

clausreinke commented 4 years ago

As much as we like the library, we're having a persistent problem in using it, namely

{
  "name": "RfcLibError",
  "code": 13,
  "key": "RFC_INVALID_HANDLE",
  "message": "An invalid handle was passed to the API call"   
}

We cannot reliably reproduce the error and as of yet, no pattern is apparent in what triggers the bug, but it happens more than once per day in light usage.

Lacking a reproducable example, we're looking for anything useful we could log to help narrow down the problem when it occurs (suggestions are welcome - currently, we log client.id, client.isAlive and client.connectionInfo) and we've found a first pattern. It does not cover all cases but sufficiently many to be interesting:

It appears that sometimes the connectionInfo goes missing in acquired connections (logged between const client = await connectionPool.acquire() and client.invoke(..)).

We have a global connection pool, constructed once, via new Pool(sapConnectionParameters, { min: 1, max: -42 }); (the minimum of 1 avoids obscuring the issue with multiple connections) and neither the pool nor the sapConnectionParameters are changed after construction (a shallow copy is made of the latter to ensure this). For every SAP call, we acquire a connection/client, check that it and its id are defined, then call client.invoke(..) and -in its callback- we end with connectionPool.release(client) (in both success and failure cases).

Sometimes, between const client = await connectionPool.acquire() and client.invoke(..), the client.connectionInfo logs as {}, followed by an RFC_INVALID_HANDLE error from the .invoke.

We're stumped as to how the connectionInfo could be empty at that point - does this give you any hint about what could be going wrong (perhaps a rogue deconstruction)?

You mentioned in https://github.com/SAP/node-rfc/issues/99#issuecomment-523369638 that

RFC_INVALID_HANDLE is usually caused by sharing the same node-rfc client connection instance among more than one thread.

We don't use any threads ourselves and since nodejs is single-threaded for the JavaScript part, the only threads in play should be the worker threads (Napi::AsyncWorker) used in the node-rfc implementation. Of course, multiple async calls can be interleaved.

OS: Windows 10 node: node-v12.14.1-win-x64 SAP NW RFC SDK: nwrfc750P_6-70002755 node-rfc: node-rfc@1.0.3

bsrdjan commented 4 years ago

I am aware of this longer standing issue, hardly reproducible in a controlled manner.

Finally found a simple way to reproduce (unit tests in locktest.js) but can't be sure exactly the same cause is happening in your environment.

Reading the code, the empty connectionInfo should never be possible, need to investigate that further.

Can you build from source the current master and test for some time if any improvements in your environment? I can also build a Windows binary or publish a (pre)release with this fix, if needed?

clausreinke commented 4 years ago

I've built from source (npm install, then npm prebuild) and will test next week.

From looking at the commit, the casts should already be implicit in mallocU/memsetU, so it is about an additional error check (always useful) and a lock in PrepareAsync, right?

I don't understand either how the connectionInfo gets emptied, since it is read directly from the sap nwrfc library and it looks like there should at least be empty fields..

Btw, the connectionInfo structure looks a bit different in nwrfc750P_6-70002755, _RFC_ATTRIBUTES in sapnwrfc.h: compare fields user, cpicConvId, progName, and partnerBytesPerChar. There are also some new fields.

bsrdjan commented 4 years ago

Beside casts, additional exclusive locks are added in Ping() and PrepareAsync() (see Lock/UnlockMutex())

With just pushed commit c5a443b, another exclusive lock is added in ConnectionInfo(), to prevent another client possibly interfere after connection handle check, invalidating connection handle before reading connectionInfo. But even in such case, an error should be returned, not empty connectionInfo.

It is hard to believe but SAP NWRFC SDK might also have a bug, will check with SDK team :)

clausreinke commented 4 years ago

)-: just had another one, this time with node-rfc e69b5b9c5a085b51d2a8c95727fe80ab199d5965

same basic effect: connection with empty connectionInfo straight out of connection pool acquire.

clausreinke commented 4 years ago

not sure whether this will help, but since we lack information: immediately before the bogus acquire result, there were two near simultaneous calls.

We log client.id, client.isAlive and client.connectionInfo, immediately after acquire, immediately before invoke, and immediately after release (both success and error case)

16:32:51 post acquire ----> 64 true connectionInfo
16:32:52 pre invoke ----> 64 true connectionInfo
16:32:52 post acquire ----> 65 true connectionInfo
16:32:52 pre invoke ----> 65 true connectionInfo
16:32:52 post release (success) ----> 65 false {}
16:32:53 post release (success) ----> 64 false {}

16:32:53 post acquire ----> 66 true {}
16:32:53 pre invoke ----> 66 true {}
16:32:53 post release (error) ----> 66 false {}
16:32:53 communication/network problem {
  "name": "RfcLibError",
  "code": 13,
  "key": "RFC_INVALID_HANDLE",
  "message": "An invalid handle was passed to the API call"
}

Note that client.connectionInfo always reports as empty immediately after release. Then it reports as empty immediately after acquire and the error is raised in invoke, leading to release and error log in the error case.

connectionInfo usually includes

{
  ..,
  "codepage": "4103",
  "partnerCodepage": "4103",
  "rfcRole": "C",
  "type": "E",
  "partnerType": "3",
  "rel": "753",
  "partnerRel": "740",
  "kernelRel": "749",
  "cpicConvId": "57521748",
  "progName": "SAPLSYST",
  "partnerBytesPerChar": "2",
  "partnerSystemCodepage": "4103",
  ..
}

connection pool options are { "min": 1, "max": -42 }

bsrdjan commented 4 years ago

After discussing with NWRFC SDK experts, two changes are implemented with f058683

The IsConnectionHandleValid check may return true, even after connection closed, if the connection handle in internal NWRFC SDK cache still looks valid. That check is in ConnectionInfo therefore replaced by simple connection alive flag check.

More restrictive logic is added to reset this flag, after certain error types raised on ABAP server side.

Could you please build from from master with this commit included and re-test?

The NWRFC SDK trace would be also helpful for analysis. The trace can be activated by setting the connection parameter trace to 3. If the resulting trace file is not too big and contains no confidential data, please attach, or we find another way to share.

clausreinke commented 4 years ago

Thanks - I've pulled, installed, prebuilt, and I'm going to test next week. Not sure about the trace yet, as it may run for a day between errors (for such cases, it would be useful if the SDK had a "rotating trace" option that only keeps the last hour of trace output - that could be copied once an error occured; for call-level tracing, we had to add our own trace variant anyway to get a useful amount of trace data).

Btw, while the previous fixes did not subjectively change the number of INVALID_HANDLEs, they did change their symptoms: all occurrences this week were of the kind "empty connectionInfo after acquire" (previously, there were other variations mixed in). So you might well be on the right track.

clausreinke commented 4 years ago

sadly, the INVALID_HANDLE reappeared quickly, though apparently as a side-effect of an "Unknown error in lower level library" (the latter had previously happened at a rate of about 1/week, at least once we were past the buggy SDK patch levels).

So the symptoms have changed, and the trace happened to be on at the time:

logging (note the interleaving and apparent interference of client ids 30 and 31, as well as the error info in place of connectionInfo):

18:16:52 post acquire----> 30 true {
"dest": "",
"host": "..",
"partnerHost": "..",
"sysNumber": "00",
"sysId": "..",
"client": "..",
"user": "..",
"language": "D",
"trace": "3",
"isoLanguage": "DE",
"codepage": "4103",
"partnerCodepage": "4103",
"rfcRole": "C",
"type": "E",
"partnerType": "3",
"rel": "753",
"partnerRel": "740",
"kernelRel": "749",
"cpicConvId": "23016689",
"progName": "SAPLSYST",
"partnerBytesPerChar": "2",
"partnerSystemCodepage": "4103",
"partnerIP": "..",
"partnerIPv6": ".."
}
18:16:52 .. called // invoke with client.id 30 happens here
// here we acquire client.id 31, claiming to be alive but with an error instead of connectionInfo
18:16:53 post acquire----> 31 true {
"name": "RfcLibError",
"code": 13,
"key": "RFC_INVALID_HANDLE",
"message": "An invalid handle was passed to the API call"
}
// here comes the callback for the failed invoke with client.id 30
18:16:53 post release (error) ----> 30 false {}
18:16:53 {
"name": "RfcLibError",
"code": 28,
"key": "",
"message": "Unknown error in lower level library"
}

trace (three seem to have been active at the time in question, one for each connection and one for admin/cleanup?) - note: both RfcOpenConnection return RFC_OK (in spite of the RfcLibError in connectionInfo above) and there is only one call to RfcGetConnectionAttributes, not two

rfc35832_30048.trc - client.id 30

2020-04-04 18:16:52.630909 [30048] >> RfcOpenConnection
>> RfcOptions
    Partner Char Size: 1
    Client: ..
    User: ..
    Alias: 
    Passwd: (- secret -)
    Language: D
    MySAPSSO2: 
    GetSSO2: 0
    X.509 Certificate: 
    ExtidData: 
    ExtidType: 
    SysNr: 00
    ASHost: ..
    ASService: 
    GWHost: 
    GWService: sapgw00
    R3Name: 
    MSHost: 
    MSService: 
    WSHost: 
    WSPort: 
    TCPHost: 
    TCPPort: 
    TLS Cryptolib: 
    TLS Client PSE: 
    TLS Server PSE: 
    TLS Trust All: Off
    Proxy Host: 
    Proxy Port: 
    Proxy User: 
    ProgramID: 
    SAPRouter: 
    SNC Mode: 0
    SNC QOP: 0
    SNC Lib: 
    SNC Myname: 
    SNC Partnername: 
    Codepage: 1100
    Trace: 3
    LogonCheck: 1
    Compression: 1
    Serialization: 0
    Ignore Char Conv Errors: 0
    Substitution Char: 35
    UseRoundtripOptimization: 0
    SO Keepalive: 0
    CPIC Keepalive: 0
    Delta Manager: 1
<< RfcOptions

2020-04-04 18:16:52.631194 [30048]  >> RfcOptions::determineType
determined type RFC_CLIENT
2020-04-04 18:16:52.631276 [30048]  << RfcOptions::determineType returned 0

2020-04-04 18:16:52.631325 [30048]  >> RfcConnection::openNew
2020-04-04 18:16:52.631381 [30048]  >> RfcConnection::connect
2020-04-04 18:16:52.631426 [30048]  >> CpicConnection::connectAsClient
2020-04-04 18:16:52.637434 [30048]  << CpicConnection::connectAsClient returned 0

>> HEXDUMP
2020-04-04 18:16:52.637600 [30048] Sending data for handle 2138511726400, convID 23016689, codepage 1100
..
<< HEXDUMP

>> HEXDUMP
2020-04-04 18:16:52.642371 [30048] Receiving data for handle 2138511726400, convID 23016689, codepage unknown
..
<< HEXDUMP

RfcConnection handle 2138511726400 established
2020-04-04 18:16:52.642790 [30048]  << RfcConnection::connect returned 0

>> RfcConnection name="none"
Handle = 2138511726400 ConvID = 23016689
>> RfcControl name=".."
    Program Name: SAPLSYST                                
    Codepage: 4103      Communication Codepage: 1100
    SysID: ..       Target: ..      Mandant: ..     SysNr: 00       Language: D     User: ..
    Conversation ID: 23016689   Role: C753
    Partner Rel: 740        Partner Kernel Rel: 749         Partner IP: ..      Partner IPv6: ..        Protocol version: 3595
    Exclusive: false        Get SSO2: false     Use basXML: 0
<< RfcControl

>> CpicConnection
    CPIC Handle: 0
    Gateway Host:       Gateway Service:        AS Host:        AS Service:         Program ID: 
    Protocol: I
    Conversation ID: 23016689
    Last IO Time: Sat Apr 04 18:16:52 2020

    Data Received: 3        Status Received: 2      All Received: 1
    Connected: true     Deallocated: false      Accepted: 0
    Server Mode: false      Registered Mode: false      Just Accepted: 0
<< CpicConnection

    Asynchron: 0
    Has Error: 0
<< RfcConnection

2020-04-04 18:16:52.643036 [30048]  << RfcConnection::openNew returned 0

2020-04-04 18:16:52.643065 [30048] << RfcOpenConnection returned RFC_OK

2020-04-04 18:16:52.762744 [30048] >> RfcInvoke
Invoking .. on handle 2138511726400

>> RfcFunction name=".."
..
<< RfcFunction

2020-04-04 18:16:52.764140 [30048]  >> RfcFunction::rfcCallReceive
>> HEXDUMP
2020-04-04 18:16:52.765782 [30048] Sending data for handle 2138511726400, convID 23016689, codepage 4103
..
<< HEXDUMP

2020-04-04 18:16:53.291129 [30048] >> Exception occurred
    [Handle: 2138511726400  ConvID: 0   SysID: ..   SysNr: 00   Target: ..  User: ..    Mandant: .. Language: D Codepage: 4103]
    Unknown error during RFC call

2020-04-04 18:16:53.291211 [30048]  << RfcFunction::rfcCallReceive returned 15

2020-04-04 18:16:53.294320 [30048] << RfcInvoke returned RFC_UNKNOWN_ERROR
    RFC_ERROR_INFO.message: Unknown error in lower level library

rfc35832_30680.trc - client.id 31

2020-04-04 18:16:53.249346 [30680] >> RfcOpenConnection
>> RfcOptions
    Partner Char Size: 1
    Client: ..
    User: ..
    Alias: 
    Passwd: (- secret -)
    Language: D
    MySAPSSO2: 
    GetSSO2: 0
    X.509 Certificate: 
    ExtidData: 
    ExtidType: 
    SysNr: 00
    ASHost: ..
    ASService: 
    GWHost: 
    GWService: sapgw00
    R3Name: 
    MSHost: 
    MSService: 
    WSHost: 
    WSPort: 
    TCPHost: 
    TCPPort: 
    TLS Cryptolib: 
    TLS Client PSE: 
    TLS Server PSE: 
    TLS Trust All: Off
    Proxy Host: 
    Proxy Port: 
    Proxy User: 
    ProgramID: 
    SAPRouter: 
    SNC Mode: 0
    SNC QOP: 0
    SNC Lib: 
    SNC Myname: 
    SNC Partnername: 
    Codepage: 1100
    Trace: 3
    LogonCheck: 1
    Compression: 1
    Serialization: 0
    Ignore Char Conv Errors: 0
    Substitution Char: 35
    UseRoundtripOptimization: 0
    SO Keepalive: 0
    CPIC Keepalive: 0
    Delta Manager: 1
<< RfcOptions

2020-04-04 18:16:53.249535 [30680]  >> RfcOptions::determineType
determined type RFC_CLIENT
2020-04-04 18:16:53.249584 [30680]  << RfcOptions::determineType returned 0

2020-04-04 18:16:53.249613 [30680]  >> RfcConnection::openNew
2020-04-04 18:16:53.249667 [30680]  >> RfcConnection::connect
2020-04-04 18:16:53.249695 [30680]  >> CpicConnection::connectAsClient
2020-04-04 18:16:53.251574 [30680]  << CpicConnection::connectAsClient returned 0

>> HEXDUMP
2020-04-04 18:16:53.251665 [30680] Sending data for handle 2138513959152, convID 23018303, codepage 1100
..
<< HEXDUMP

>> HEXDUMP
2020-04-04 18:16:53.256384 [30680] Receiving data for handle 2138513959152, convID 23018303, codepage unknown
..
<< HEXDUMP

RfcConnection handle 2138513959152 established
2020-04-04 18:16:53.257160 [30680]  << RfcConnection::connect returned 0

>> RfcConnection name="none"
Handle = 2138513959152 ConvID = 23018303
>> RfcControl name=".."
    Program Name: SAPLSYST                                
    Codepage: 4103      Communication Codepage: 1100
    SysID: ..       Target: ..      Mandant: ..     SysNr: 00       Language: D     User: ..
    Conversation ID: 23018303   Role: C753
    Partner Rel: 740        Partner Kernel Rel: 749         Partner IP: ..      Partner IPv6: ..        Protocol version: 3595
    Exclusive: false        Get SSO2: false     Use basXML: 0
<< RfcControl

>> CpicConnection
    CPIC Handle: 0
    Gateway Host:       Gateway Service:        AS Host:        AS Service:         Program ID: 
    Protocol: I
    Conversation ID: 23018303
    Last IO Time: Sat Apr 04 18:16:53 2020

    Data Received: 3        Status Received: 2      All Received: 1
    Connected: true     Deallocated: false      Accepted: 0
    Server Mode: false      Registered Mode: false      Just Accepted: 0
<< CpicConnection

    Asynchron: 0
    Has Error: 0
<< RfcConnection

2020-04-04 18:16:53.257545 [30680]  << RfcConnection::openNew returned 0

2020-04-04 18:16:53.257588 [30680] << RfcOpenConnection returned RFC_OK

rfc35832_12896.trc - admin/cleanup?

2020-04-04 18:16:52.643281 [12896] >> RfcGetConnectionAttributes
2020-04-04 18:16:52.643328 [12896] << RfcGetConnectionAttributes returned RFC_OK

2020-04-04 18:16:53.271555 [12896] >> RfcCloseConnection
2020-04-04 18:16:53.271647 [12896]  >> RfcConnection::closeHandle
>> RfcConnection name="none"
Handle = 2138511726400 ConvID = 23016689
>> RfcControl name=".."
    Program Name: SAPLSYST                                
    Codepage: 4103      Communication Codepage: 1100
    SysID: ..       Target: ..      Mandant: ..     SysNr: 00       Language: D     User: ..
    Conversation ID: 23016689   Role: C753
    Partner Rel: 740        Partner Kernel Rel: 749         Partner IP: ..      Partner IPv6: ..        Protocol version: 3595
    Exclusive: false        Get SSO2: false     Use basXML: 0
<< RfcControl

>> CpicConnection
    CPIC Handle: 0
    Gateway Host:       Gateway Service:        AS Host:        AS Service:         Program ID: 
    Protocol: I
    Conversation ID: 23016689
    Last IO Time: Sat Apr 04 18:16:52 2020

    Data Received: 0        Status Received: 0      All Received: 1
    Connected: true     Deallocated: false      Accepted: 0
    Server Mode: false      Registered Mode: false      Just Accepted: 0
<< CpicConnection

    Asynchron: 0
    Has Error: 0
<< RfcConnection

2020-04-04 18:16:53.272157 [12896]  >> RfcConnection::disconnect
2020-04-04 18:16:53.272204 [12896]  >> CpicConnection::close
2020-04-04 18:16:53.283041 [12896]  << CpicConnection::close returned 0

2020-04-04 18:16:53.283110 [12896]  << RfcConnection::disconnect returned 0

2020-04-04 18:16:53.283157 [12896]  << RfcConnection::closeHandle returned 0

2020-04-04 18:16:53.283184 [12896] << RfcCloseConnection returned RFC_OK

2020-04-04 18:16:53.283285 [12896] >> RfcCloseConnection
2020-04-04 18:16:53.283310 [12896]  >> RfcConnection::closeHandle
>> RfcConnection name="none"
Handle = 2138513959152 ConvID = 23018303
>> RfcControl name=".."
    Program Name: SAPLSYST                                
    Codepage: 4103      Communication Codepage: 1100
    SysID: ..       Target: ..      Mandant: ..     SysNr: 00       Language: D     User: ..
    Conversation ID: 23018303   Role: C753
    Partner Rel: 740        Partner Kernel Rel: 749         Partner IP: ..      Partner IPv6: ..        Protocol version: 3595
    Exclusive: false        Get SSO2: false     Use basXML: 0
<< RfcControl

>> CpicConnection
    CPIC Handle: 0
    Gateway Host:       Gateway Service:        AS Host:        AS Service:         Program ID: 
    Protocol: I
    Conversation ID: 23018303
    Last IO Time: Sat Apr 04 18:16:53 2020

    Data Received: 3        Status Received: 2      All Received: 1
    Connected: true     Deallocated: false      Accepted: 0
    Server Mode: false      Registered Mode: false      Just Accepted: 0
<< CpicConnection

    Asynchron: 0
    Has Error: 0
<< RfcConnection

2020-04-04 18:16:53.283547 [12896]  >> RfcConnection::disconnect
2020-04-04 18:16:53.283572 [12896]  >> CpicConnection::close
2020-04-04 18:16:53.293954 [12896]  << CpicConnection::close returned 0

2020-04-04 18:16:53.294004 [12896]  << RfcConnection::disconnect returned 0

2020-04-04 18:16:53.294040 [12896]  << RfcConnection::closeHandle returned 0

2020-04-04 18:16:53.294069 [12896] << RfcCloseConnection returned RFC_OK

2020-04-04 18:16:53.301973 [12896] >> RfcDestroyFunction
Deleting function ..
2020-04-04 18:16:53.302058 [12896] << RfcDestroyFunction returned RFC_OK

bsrdjan commented 4 years ago

Checked together with NWRFC SDK experts and RFC_UNKNOWN_ERROR, coming from levels below the SDK, makes it hard to guess what could be wrong here.

I found two critical node-rfc gaps which might have impact here:

Closing the connection possible while RFC call ongoing
Pool reopen() method, used when releasing the connection back to Pool, resolves to void/undefined. It should be the Client instance, just like the Pool open() method

These gaps, leading to issues like #132 and #133, are now fixed in master.

Client locking is also checked in detail and more unit tests added for Pool.

All these fixes will be shipped with next release, as soon the #122 fixed, waiting for https://github.com/prebuild/prebuild/issues/267.

If waiting for prebuild issue takes longer, please feel free to build from current master: 5837df5

These new added features might be helpful in troubleshooting and custom logging:

Number of RFC calls currently running via the client instance: https://github.com/SAP/node-rfc/blob/master/test/locking/callback.spec.js#L41
Throughput of one or more client instances: https://github.com/SAP/node-rfc/blob/master/test/throughput.spec.js
Client instance created, last open/call/close timestamps: https://github.com/SAP/node-rfc/blob/master/test/throughput.spec.js

Hope this helps :)

bsrdjan commented 4 years ago

Changes are published in v1.1.0

clausreinke commented 4 years ago

Thanks, it is good to see you're active on these issues (and in touch with the SDK folks) and that there is visible progress. Though the "closing while active" issue should not affect us.

I did not have as much opportunity to test last week as before and did not keep the trace on all the time but -subjectively- the number of incidents seems to have gone down (from at least once per day to less than once per day, over a week). Will update and check whether the additional info helps with logging.

And having the existing fixes published prompts me to update our other servers (I kept them running on 1.0.3 for comparison, and they're still having more frequent issues, including random shutdowns).

The thing that confused me about the last logs I provided was that one call failed and another, interleaved, call got the INVALID_HANDLE - was that a coincidence of two calls failing one after the other with different errors or was there some interaction between one call failing and the other ones error message?

Btw, when browsing the trace files, I see that the "Sending data" entries vary between code pages 4103 and 1100 while the "Receiving data" entries vary between 4103 and unknown - is that normal (I would expect sending/receiving to stick to one known code page each)?

clausreinke commented 4 years ago

I'm currently a bit confused about the extra outputs, in particular .status timestamps and ._connectionHandles. I am also starting to wonder where the client .id comes from and how it is expected to behave (as it is used to index the pool).

First, I noticed that some freshly acquired Clients had lastcall stamps that were smaller than their lastopened stamps, then I followed one ._connectionHandle and saw it reappear with different client ids And the client ids are not monotonically increasing, either.

Slice of log for specific ._connectionHandle and the .ids associated with it (in order, each line has client .id, ._connectionHandle, .isAlive, .runningRFCCalls, .status ):

12:54:18  post acquire           ----> 3 (1506563273056) true 0 {"created":1586861658889,"lastopen":1586861658889,"lastclose":0,"lastcall":0} 
12:54:20  post release (success) ----> 3 (undefined) false 0 {"created":1586861658889,"lastopen":1586861658889,"lastclose":1586861660586,"lastcall":1586861660114} 
14:22:48  post acquire           ----> 2 (1506563273056) true 0 {"created":1586861658666,"lastopen":1586866961499,"lastclose":0,"lastcall":1586866961390} 
14:22:48  post release (success) ----> 2 (undefined) false 0 {"created":1586861658666,"lastopen":1586866968976,"lastclose":0,"lastcall":1586866968887} 
14:22:49  post acquire           ----> 2 (1506563273056) true 0 {"created":1586861658666,"lastopen":1586866968976,"lastclose":0,"lastcall":1586866968887} 
14:22:49  post release (success) ----> 2 (undefined) false 0 {"created":1586861658666,"lastopen":1586866969261,"lastclose":0,"lastcall":1586866969166} 
14:22:49  post acquire           ----> 2 (1506563273056) true 0 {"created":1586861658666,"lastopen":1586866969261,"lastclose":0,"lastcall":1586866969166} 
14:22:49  post release (success) ----> 2 (undefined) false 0 {"created":1586861658666,"lastopen":1586866969861,"lastclose":0,"lastcall":1586866969830} 
14:22:50  post acquire           ----> 2 (1506563273056) true 0 {"created":1586861658666,"lastopen":1586866969861,"lastclose":0,"lastcall":1586866969830} 
14:23:00  post release (success) ----> 2 (undefined) false 0 {"created":1586861658666,"lastopen":1586866969861,"lastclose":1586866980632,"lastcall":1586866970450}

clausreinke commented 4 years ago

Here is an INVALID_HANDLE with the latest versions (again, trace wasn't on, but some of the new logging options were)

versions:
{"major":7500,"minor":0,"patchLevel":6,"binding":"1.1.0"}

connectionInfo:
{
  "dest": "",
  "host": "-",
  "partnerHost": "-",
  "sysNumber": "00",
  "sysId": "-",
  "client": "-",
  "user": "-",
  "language": "D",
  "trace": "0",
  "isoLanguage": "DE",
  "codepage": "4103",
  "partnerCodepage": "4103",
  "rfcRole": "C",
  "type": "E",
  "partnerType": "3",
  "rel": "753",
  "partnerRel": "740",
  "kernelRel": "749",
  "cpicConvId": "89864650",
  "progName": "SAPLSYST",
  "partnerBytesPerChar": "2",
  "partnerSystemCodepage": "4103",
  "partnerIP": "-",
  "partnerIPv6": "-"
}

given that previous traces showed code pages 4103 and 1100, is it odd that there is only 4103 here?

As the connectionHandle sees some reuse, here is a log slice wrt that handle, ending in an INVALID_HANDLE directly on acquire

most of the time, the handle remains defined and the client valid even after release
except for the last acquire, the same handle ist acquired with different ids

except for the last acquire, the timestamps change even if the handle does not


18:31:23 connection, post acquire----> 3 (3069229760224) true 0 {"created":1586881883129,"lastopen":1586881883129,"lastclose":0,"lastcall":0} {connectionInfo}
18:31:23 RFC F1 called
18:31:25 RFC F1 (after 2.821s) returns
18:31:26 connection, post release (success) ----> 3 (3069229760224) true 0 {"created":1586881883129,"lastopen":1586881883129,"lastclose":1586881886000,"lastcall":1586881883176} {}

19:12:24 connection, post acquire----> 4 (3069229760224) true 0 {"created":1586884344466,"lastopen":1586884344466,"lastclose":0,"lastcall":0} {connectionInfo} 19:12:24 RFC F1 called 19:12:26 RFC F1 (after 2.269s) returns 19:12:26 connection, post release (success) ----> 4 (undefined) false 0 {"created":1586884344466,"lastopen":1586884344466,"lastclose":1586884346760,"lastcall":1586884344489} {}

19:12:29 connection, post acquire----> 5 (3069229760224) true 0 {"created":1586884349515,"lastopen":1586884349516,"lastclose":0,"lastcall":0} {connectionInfo} 19:12:29 RFC F2 called 19:12:30 RFC F2 (after 1.058s) returns 19:12:30 connection, post release (success) ----> 5 (3069229760224) true 0 {"created":1586884349515,"lastopen":1586884350590,"lastclose":0,"lastcall":1586884349531} {}

19:12:30 connection, post acquire----> 5 (3069229760224) true 0 {"created":1586884349515,"lastopen":1586884350590,"lastclose":0,"lastcall":1586884349531} { "name": "RfcLibError", "code": 13, "key": "RFC_INVALID_HANDLE", "message": "An invalid handle 'RFC_CONNECTION_HANDLE' was passed to the API call" }

bsrdjan commented 4 years ago

Thanks a lot for the very valuable input.

Btw, when browsing the trace files, I see that the "Sending data" entries vary between code pages 4103 and 1100 while the "Receiving data" entries vary between 4103 and unknown - is that normal (I would expect sending/receiving to stick to one known code page each)?

Yes, that’s normal. No need to worry, the codepages are handled automatically between NWRFC SDK and RFC server system.

I am also starting to wonder where the client .id comes from and how it is expected to behave (as it is used to index the pool).

The client id comes from N-API/C wrapper and incremented with each new Client instance. After opening the connection, the client gets the connection handle. The handle can change after re-opening the connection, id stays the same. The reopen happens after the client released back to pool and acquired for new task, for example.

The thing that confused me about the last logs I provided was that one call failed and another, interleaved, call got the INVALID_HANDLE - was that a coincidence of two calls failing one after the other with different errors or was there some interaction between one call failing and the other ones error message?

This could be the actually the cause of getting RFC_INVALID_HANDLE. After certain errors, the connection is automatically closed and cleaned-up by NWRFC SDK. The node-rfc was so far in this regard transparent, only updating isAlive flag. Checking this flag before each call in NodeJS is not enough because one client can start multiple client.call() calls in parallel and all would see alive === true. After one call fails with critical error, the connection is broken, not anymore useful for other pending calls. When the next call comes for execution, the RFC_INVALID_HANDLE is raised.

With the latest commit 5794944, the node-rfc automatically re-opens the connection, after critical errors. This should eliminate or at least considerably reduce RFC_INVALID_HANDLE errors.

If the problem persists, I would try to reproduce by using the same stack (any other components, frameworks on top of node-rfc?) and calling RFC_PING_AND_WAIT and RFC_RAISE_ERROR ABAP modules, to simulate some real ABAP processing.

clausreinke commented 4 years ago

After a full day of testing 1.1.0 on two servers, that version definitely generates a lot more RFC_INVALID_HANDLEs, but almost all of them come immediately after the acquire, the rest from reused handles. So that would support your theory. Will update to the current HEAD and test more.

I do find it odd that acquire would simply return something that looks like a client connection but is not valid. Reopening connections that are likely invalid should avoid this, but shouldn't acquire have returned or thrown a clear error condition instead of returning an invalid handle with an error hidden in the connectionInfo?

clausreinke commented 4 years ago

Still getting a lot of RFC_INVALID_HANDLEs with e7c1fba9e15d4492af2335c381788dee6dc03db8 , mostly after acquire:-(

clausreinke commented 4 years ago

Could there be an option for not reusing connections (ie not putting the reopened client back into the pool after release but just closing it), so that acquire always returns fresh clients?

bsrdjan commented 4 years ago

The reason for empty connectionInfo is still open and discussed with NWRFC SDK experts but hard to guess without reproducible example. Which components/frameworks are used on servers, on top of Pool? How the typical application looks like? I would try to make more stress tests, with similar setup.

bsrdjan commented 4 years ago

You may test also the Pool without reuse of released connections 9833196.

In parallel I will check if Pool functionality could be moved to C, making it less dependable on Promises implementation.

bsrdjan commented 4 years ago

I observed some unit test behave differently when running from the command line and via npm script, executing exactly the same command line. Sometimes, via npm script, the tests were failing because the client.id could not be referenced because the client is undefined. Isolated tests and all tests via command line were always passing and failing only when running all tests via npm script.

Suspecting it could be somehow related to Promise used in Pool internal active/ready collections, I replaced them with Client instances, reducing the overhead.

The npm testing issues are now gone and perhaps it could improve the situation in your environment: a6ec7ef

If not, the Pool shall be moved to C level, becoming independent of Promise implementations on different platforms.

clausreinke commented 4 years ago

I'm still observing acquire returning a client where getting the connectionInfo will return an INVALID_HANDLE error and where the _connectionHandle has been seen before in the same server run (even though lastcall is 0 and the id is different):

21:10:36 connection, post acquire ----> 54 (2783525369200) true 0 {"created":1587582636762,"lastopen":1587582636762,"lastclose":0,"lastcall":0} {"major":7500,"minor":0,"patchLevel":6,"binding":"1.2.0"} {
  "alive": true,
  "name": "RfcLibError",
  "code": 13,
  "codeString": "RFC_INVALID_HANDLE",
  "key": "RFC_INVALID_HANDLE",
  "message": "An invalid handle 'RFC_CONNECTION_HANDLE' was passed to the API call"
}

As before, the INVALID_HANDLE in getting the connectionInfo predicts/precedes the same error on the actual invoke that follows.

clausreinke commented 4 years ago

I've collected 2 somewhat longer log slices, hoping to see some connections.

As before, the individual calls proceed as follows:

acquire
invoke
then either
- return info + release (success)
- release (error) + error message + return info

acquire and release have an await, invoke a callback (the release calls are in this callback).

post acquire and release, we log client .id , ._connectionHandle, .isAlive, .runningRFCCalls, .status, and .connectionInfo. The client.id is used to track the other events through interleaved calls.

The most common failure pattern is still:

P1.

post acquire, .connectionInfo gives an INVALID_HANDLE error,
the following invoke raises the same error

However, a couple of other patterns can also be observed:

P2. 2020-04-24 11:51:01

post acquire for id 27, we have a client with INVALID_HANDLE error in .connectionInfo
a second acquire for id 28 returns the very same ._connectionHandle (27 has not been released at this point)
the invoke for 28 results in a communication failure

{
  "alive": true,
  "name": "RfcLibError",
  "code": 1,
  "codeString": "RFC_COMMUNICATION_FAILURE",
  "key": "RFC_COMMUNICATION_FAILURE",
  "message": {
    "message(split at line breaks)": [
      "LOCATION    CPIC (TCP/IP) on local host with Unicode",
      "ERROR       program state check for conversation 33641697",
      "TIME        Fri Apr 24 13:35:57 2020",
      "RELEASE     753",
      "COMPONENT   CPIC (TCP/IP) with Unicode",
      "VERSION     3",
      "RC          471",
      "MODULE      D:/depot/bas/753_REL/src/krn/si/cpic/r3cpic.c",
      "LINE        6378",
      "DETAIL      called function STSEND in state state=BUFFER_DATA2",
      "COUNTER     8"
    ]
  }
}

P3. 2020-04-24 13:00:37

post acquire for id 154, we have a client with INVALID_HANDLE error in .connectionInfo
however, the following invoke for 154 succeeds!

Pattern 3 repeats 2020-04-24 13:30:27, id 16, and 2020-04-24 13:30:28, id 17, 2020-04-24 13:33:03, id 19, 2020-04-24 13:33:05, id 20, all with the same ._connectionHandle.

Then, we get another instance of pattern 2 (2020-04-24 13:33:12, identical ._connectionHandle concurrently used for 2 client.ids, 22 and 23, leading to RFC_COMMUNICATION_FAILURE.

Back to pattern 3 for a while, then another pattern 2, this time with 3 ids that share the same ._connectionHandle (2020-04-24 13:35:54, ids 32, 33, and 34). The acquire for 34 does not come with an INVALID_HANDLE error but the invoke for 34 results in RFC_COMMUNICATION_FAILURE.

One instance of pattern 3 and then we're back to pattern 1 again.

Todays log has the same 3 failure patterns. Pattern 3 means that an INVALID_HANDLE error for .connectionInfo after acquire is not a certain indicator of invoke failure, but currently, I'm more curious how several consecutive and even concurrent acquires can end up with the same ._connectionHandle, with differing .connectionInfos.

The frequent reuse of ._connectionHandles suggests that the .status might better be associated with handles, not with clients.

I cannot see any obvious culprit in node-rfc (I had several theories, none of which survived further inspection;-), so either there is a non-obvious issue (perhaps via node-api, or your suspected interaction of Promise implementations) or the issue may be in the SDK. This server is build via CI and does not yet include your Promise-avoidance commit but when I log Promise, I get [native code]. logslice-2020-04-24.txt logslice-2020-04-27.txt

clausreinke commented 4 years ago

Hi again,

I've finally managed to produce a tiny example capable of reproducing the issue without involving our project code. It is not completely automated, as the issue is timing-related and -so far- needs a human touch to trigger.

invalid-handle.zip

The example (index.js + package.json) is a small express server with only a single route, which -via node-rfc- calls a standard ABAP function FUNCTION_IMPORT_DOKU to describe itself

unzip, npm install
add a ./sapnwrfc.ini with DEST=DEV
start the server: RFC_INI=./sapnwrfc.ini node index.js
visit the route in your browser: http://localhost:3456/doc/function
the non-trivial result is returned to the HTTP client and logged to console
now repeatedly reload, not neccessarily giving the server time to complete each request
after a few quick (<20, depending on your timing) repeats, the INVALID_HANDLE appears

(tried with node v14.1.0 on windows 10, using node-rfc@1.2.0)

It seems important not only that the RFC call returns a non-trivial result, but that the JS thread is blocked for a while logging said result while new HTTP requests keep coming in.

clausreinke commented 4 years ago

Hi @bsrdjan It has been quiet here these last 3 weeks - have you been able to reproduce the issue using the mini-project I provided last week? Any ideas/suggestions on how to proceed?

bsrdjan commented 4 years ago

Hi @clausreinke and sorry for the longer waiting. I just did a brief test and can investigate in detail next week. Thank you very much for this example, it helps reproducing on Linux. Did not try on Windows but expect the same result. I slightly modified the test setup as follows:

start the server in background mode, redirecting the output to log file:
```
$ RFC_INI=./sapnwrfc.ini node index.js > log.txt &
```

repeat reload by:

$ for i in {1..10}; do curl localhost:3456/doc/function; done

Please find logs with RFC_INVALID_HANDLE attached, need to investigate in detail: logs.zip

clausreinke commented 4 years ago

Hi again; I had another look through the node-rfc code and could still not find any obvious trouble spots. So I was wondering whether the code might tell a higher-level story that somehow could break down at a lower level (so we'd only think everything is fine).

So far, I've only found one potential issue along those lines. Perhaps you've covered it already but as we're stuck, I thought I'd mention it.

Background: according to the programming guide, the Sap NW RFC library is thread-safe, but for communication and table handles. node-rfc covers table handles by copying the data and communication handles by using a semaphore invocationMutex.

potential issue: each invocationMutex is associated with a node-rfc Client, not with the underlying communication handle (should there be a map from handles to locks instead?)
- now, if the communication handle is closed behind the scenes, because of an error, it could be reused for another client, resulting in two clients, two semaphores, two threads and one handle
- for RfcInvoke, the code already protects against this case by opening an new connection if the old one was closed due to error
- the API docs only mention the closing of handles in error case for RfcInvoke but the error codes are identical for other API-calls that involve handles - do they need the same protection? For instance, what happens if RfcGetFunctionDesc runs into an error that closes the connection handle?
question: client.connectionInfo calls RfcGetConnectionAttributes without locking; it is read-only, so this seems reasonable but could it be possible for other threads to change the connection handle while RfcGetConnectionAttributes is running? Could that cause issues?

clausreinke commented 4 years ago

Yay!-)

Following the theory from #147, I've instrumented node-rfc to log, in the Client deconstructor, which client id closes which handle; I've also activated the LOCK logging, add client id and handle there, too; then I've re-run the reproduction example:

In many cases, at least, this does indeed seem to be the cause of trouble, because a client that previously owned a handle closes it again when it has already been reopened by another client - that other client then fails because the handle has disappeared.

I suggest to keep track of whether a client has lost control of its handle

by closing it explicitly
by running into an error that closed it implicitly

As the latter will happen in Execute, this is probably where the checks will have to be (don't call anything if handle control has been lost, after call, check whether handle control has been lost due to error). If neccessary/helpful, OnOK can reestablish control by opening a new handle. And the deconstructor must not try to close a handle it has lost control of.

bsrdjan commented 4 years ago

Hi @clausreinke,

thank you for the thoughtful analysis and the patience with this issue.

Based on you example, the test setup with express is in invalid_handle folder and the issue is reproducible with run.sh script.

The test is based on your example, with following added:

1) The node-rfc dev version used, for faster testing

2) The client http requests simulated by curl/wget/...

3) The client http request number can be optionally passed to express and used in tracing of Pool states during requests.

The test with let say 20 requests, creates 5 RFC trace files, indicating 5 threads consuming the Pool. The current implementation of the Pool can't handle that safely because there is no lock/mutex to synchronise taking one connection from pool and giving it back. That must be protected.

Inspecting RFC traces, I saw connection handles occurring after closing the connection and learned that handles are not necessarily unique during the session.

The handle is just the pointer to the C ++ object, and after a free / delete the C-Runtime can re-use the same address again in a subsequent malloc / new. It happens very often (especially on Windows), that ever smarter memory management system “notices” the block of memory is just the right size and the block is re-used, instead of allocating a new one, causing unnecessary fragmentation.

The synchronised consumption of connections fails normally in a following way. The Thread A opens connection x and does something with it. The call ends in SysFailure / CommFailure, so that the connection breaks (and is deleted again). The application does not pay attention to the return code and tries to continue and use actually broken connection. This normally results in INVALID_HANDLE.

In current Pool implementation, the connection is closed when released back to Pool, as attempt to prevent reusing eventually broken connection. But even that is not possible with synchronised threads. The close in release shall be definitely removed and connections in release closed only if the total > Pool.max.

The traces of the current Pool implementation show something like: Before Thread A "uses" handle x, Thread B comes along and also opens a connection. By chance, the C-Runtime also takes the address x again. So now the deal is "valid" again. Now thread A continues, just using the connection that thread B just opened for itself ... (It took a while until debugged and understood ...)

The lower the frequency with which connections are opened and closed again, the greater is statistical probability that this will happen.

I would therefore move the Pool to C, synchronise with existing Client mutex, in next couple of days and re-test :)

clausreinke commented 4 years ago

I'm not sure I understand you correctly, in particular, how moving the connection pool from js to c would help with the multiple threads. There should be only a single thread evaluating js callbacks, the other threads should be from a libuv thread pool, accessed via napi async workers; according to docs, only the Execute method actually runs outside the main thread (main thread being the event loop running js code), on one of the libuv worker threads.

The client semaphore only works for the connection as long as the latter is open - once the connection has been closed, be it by the client or by the backend or by the rfc library, the client must not use it again, semaphore or not; in particular, the close in the client destructor (while necessary) must be guarded, not by whether the connection is currently valid, but by whether it has already been closed since this client opened it.

Even if a client successfully closes its connection, it must not try to do so again in its destructor, because another client (using a different semaphore) may have opened the same connection in the meantime.

bsrdjan commented 4 years ago

It took more time and here the re-designed version: v2-pre branch.

Please feel free to test, review and share the feedback. The most important changes are in the changelog and in documentation: direct-and-managed-clients

bsrdjan commented 4 years ago

The version 2.0 published, should fix the issue.

bsrdjan commented 4 years ago

Just checking the status here. Any updates in the meantime, any issues left ?

bsrdjan commented 4 years ago

Let close this one and open new issue of needed.

olsonperrensen commented 2 years ago

Can someone provide me a mirror link for nwrfc750P ?

bsrdjan commented 2 years ago

Can someone provide me a mirror link for nwrfc750P ?

There are no mirrors, SAP Support Portal is the only distribution channel for SAP NW RFC SDK

https://support.sap.com/en/product/connectors/nwrfcsdk.html

Information on where to download the 7.50 version of the SAP NetWeaver RFC SDK and information about currently supported platforms can be found in SAP Note 2573790.

SAP / node-rfc

RFC_INVALID_HANDLE #128