Closed clausreinke closed 4 years ago
I am aware of this longer standing issue, hardly reproducible in a controlled manner.
Finally found a simple way to reproduce (unit tests in locktest.js) but can't be sure exactly the same cause is happening in your environment.
Reading the code, the empty connectionInfo should never be possible, need to investigate that further.
Can you build from source the current master and test for some time if any improvements in your environment? I can also build a Windows binary or publish a (pre)release with this fix, if needed?
I've built from source (npm install
, then npm prebuild
) and will test next week.
From looking at the commit, the casts should already be implicit in mallocU
/memsetU
, so it is about an additional error check (always useful) and a lock in PrepareAsync
, right?
I don't understand either how the connectionInfo gets emptied, since it is read directly from the sap nwrfc library and it looks like there should at least be empty fields..
Btw, the connectionInfo structure looks a bit different in nwrfc750P_6-70002755
, _RFC_ATTRIBUTES
in sapnwrfc.h
: compare fields user
, cpicConvId
, progName
, and partnerBytesPerChar
. There are also some new fields.
Beside casts, additional exclusive locks are added in Ping() and PrepareAsync() (see Lock/UnlockMutex())
With just pushed commit c5a443b, another exclusive lock is added in ConnectionInfo(), to prevent another client possibly interfere after connection handle check, invalidating connection handle before reading connectionInfo. But even in such case, an error should be returned, not empty connectionInfo.
It is hard to believe but SAP NWRFC SDK might also have a bug, will check with SDK team :)
)-: just had another one, this time with node-rfc e69b5b9c5a085b51d2a8c95727fe80ab199d5965
same basic effect: connection with empty connectionInfo straight out of connection pool acquire.
not sure whether this will help, but since we lack information: immediately before the bogus acquire
result, there were two near simultaneous calls.
We log client.id
, client.isAlive
and client.connectionInfo
, immediately after acquire
, immediately before invoke
, and immediately after release
(both success and error case)
16:32:51 post acquire ----> 64 true connectionInfo
16:32:52 pre invoke ----> 64 true connectionInfo
16:32:52 post acquire ----> 65 true connectionInfo
16:32:52 pre invoke ----> 65 true connectionInfo
16:32:52 post release (success) ----> 65 false {}
16:32:53 post release (success) ----> 64 false {}
16:32:53 post acquire ----> 66 true {}
16:32:53 pre invoke ----> 66 true {}
16:32:53 post release (error) ----> 66 false {}
16:32:53 communication/network problem {
"name": "RfcLibError",
"code": 13,
"key": "RFC_INVALID_HANDLE",
"message": "An invalid handle was passed to the API call"
}
Note that client.connectionInfo
always reports as empty immediately after release
. Then it reports as empty immediately after acquire
and the error is raised in invoke
, leading to release
and error log in the error case.
connectionInfo
usually includes
{
..,
"codepage": "4103",
"partnerCodepage": "4103",
"rfcRole": "C",
"type": "E",
"partnerType": "3",
"rel": "753",
"partnerRel": "740",
"kernelRel": "749",
"cpicConvId": "57521748",
"progName": "SAPLSYST",
"partnerBytesPerChar": "2",
"partnerSystemCodepage": "4103",
..
}
connection pool options are { "min": 1, "max": -42 }
After discussing with NWRFC SDK experts, two changes are implemented with f058683
The IsConnectionHandleValid
check may return true, even after connection closed, if the connection handle in internal NWRFC SDK cache still looks valid. That check is in ConnectionInfo therefore replaced by simple connection alive
flag check.
More restrictive logic is added to reset this flag, after certain error types raised on ABAP server side.
Could you please build from from master with this commit included and re-test?
The NWRFC SDK trace would be also helpful for analysis. The trace can be activated by setting the connection parameter trace
to 3
. If the resulting trace file is not too big and contains no confidential data, please attach, or we find another way to share.
Thanks - I've pulled, installed, prebuilt, and I'm going to test next week. Not sure about the trace yet, as it may run for a day between errors (for such cases, it would be useful if the SDK had a "rotating trace" option that only keeps the last hour of trace output - that could be copied once an error occured; for call-level tracing, we had to add our own trace variant anyway to get a useful amount of trace data).
Btw, while the previous fixes did not subjectively change the number of INVALID_HANDLE
s, they did change their symptoms: all occurrences this week were of the kind "empty connectionInfo after acquire" (previously, there were other variations mixed in). So you might well be on the right track.
sadly, the INVALID_HANDLE
reappeared quickly, though apparently as a side-effect of an "Unknown error in lower level library"
(the latter had previously happened at a rate of about 1/week, at least once we were past the buggy SDK patch levels).
So the symptoms have changed, and the trace happened to be on at the time:
logging (note the interleaving and apparent interference of client ids 30 and 31, as well as the error info in place of connectionInfo):
18:16:52 post acquire----> 30 true {
"dest": "",
"host": "..",
"partnerHost": "..",
"sysNumber": "00",
"sysId": "..",
"client": "..",
"user": "..",
"language": "D",
"trace": "3",
"isoLanguage": "DE",
"codepage": "4103",
"partnerCodepage": "4103",
"rfcRole": "C",
"type": "E",
"partnerType": "3",
"rel": "753",
"partnerRel": "740",
"kernelRel": "749",
"cpicConvId": "23016689",
"progName": "SAPLSYST",
"partnerBytesPerChar": "2",
"partnerSystemCodepage": "4103",
"partnerIP": "..",
"partnerIPv6": ".."
}
18:16:52 .. called // invoke with client.id 30 happens here
// here we acquire client.id 31, claiming to be alive but with an error instead of connectionInfo
18:16:53 post acquire----> 31 true {
"name": "RfcLibError",
"code": 13,
"key": "RFC_INVALID_HANDLE",
"message": "An invalid handle was passed to the API call"
}
// here comes the callback for the failed invoke with client.id 30
18:16:53 post release (error) ----> 30 false {}
18:16:53 {
"name": "RfcLibError",
"code": 28,
"key": "",
"message": "Unknown error in lower level library"
}
trace (three seem to have been active at the time in question, one for each connection and one for admin/cleanup?) - note: both RfcOpenConnection
return RFC_OK
(in spite of the RfcLibError
in connectionInfo above) and there is only one call to RfcGetConnectionAttributes
, not two
rfc35832_30048.trc
- client.id
30
2020-04-04 18:16:52.630909 [30048] >> RfcOpenConnection
>> RfcOptions
Partner Char Size: 1
Client: ..
User: ..
Alias:
Passwd: (- secret -)
Language: D
MySAPSSO2:
GetSSO2: 0
X.509 Certificate:
ExtidData:
ExtidType:
SysNr: 00
ASHost: ..
ASService:
GWHost:
GWService: sapgw00
R3Name:
MSHost:
MSService:
WSHost:
WSPort:
TCPHost:
TCPPort:
TLS Cryptolib:
TLS Client PSE:
TLS Server PSE:
TLS Trust All: Off
Proxy Host:
Proxy Port:
Proxy User:
ProgramID:
SAPRouter:
SNC Mode: 0
SNC QOP: 0
SNC Lib:
SNC Myname:
SNC Partnername:
Codepage: 1100
Trace: 3
LogonCheck: 1
Compression: 1
Serialization: 0
Ignore Char Conv Errors: 0
Substitution Char: 35
UseRoundtripOptimization: 0
SO Keepalive: 0
CPIC Keepalive: 0
Delta Manager: 1
<< RfcOptions
2020-04-04 18:16:52.631194 [30048] >> RfcOptions::determineType
determined type RFC_CLIENT
2020-04-04 18:16:52.631276 [30048] << RfcOptions::determineType returned 0
2020-04-04 18:16:52.631325 [30048] >> RfcConnection::openNew
2020-04-04 18:16:52.631381 [30048] >> RfcConnection::connect
2020-04-04 18:16:52.631426 [30048] >> CpicConnection::connectAsClient
2020-04-04 18:16:52.637434 [30048] << CpicConnection::connectAsClient returned 0
>> HEXDUMP
2020-04-04 18:16:52.637600 [30048] Sending data for handle 2138511726400, convID 23016689, codepage 1100
..
<< HEXDUMP
>> HEXDUMP
2020-04-04 18:16:52.642371 [30048] Receiving data for handle 2138511726400, convID 23016689, codepage unknown
..
<< HEXDUMP
RfcConnection handle 2138511726400 established
2020-04-04 18:16:52.642790 [30048] << RfcConnection::connect returned 0
>> RfcConnection name="none"
Handle = 2138511726400 ConvID = 23016689
>> RfcControl name=".."
Program Name: SAPLSYST
Codepage: 4103 Communication Codepage: 1100
SysID: .. Target: .. Mandant: .. SysNr: 00 Language: D User: ..
Conversation ID: 23016689 Role: C753
Partner Rel: 740 Partner Kernel Rel: 749 Partner IP: .. Partner IPv6: .. Protocol version: 3595
Exclusive: false Get SSO2: false Use basXML: 0
<< RfcControl
>> CpicConnection
CPIC Handle: 0
Gateway Host: Gateway Service: AS Host: AS Service: Program ID:
Protocol: I
Conversation ID: 23016689
Last IO Time: Sat Apr 04 18:16:52 2020
Data Received: 3 Status Received: 2 All Received: 1
Connected: true Deallocated: false Accepted: 0
Server Mode: false Registered Mode: false Just Accepted: 0
<< CpicConnection
Asynchron: 0
Has Error: 0
<< RfcConnection
2020-04-04 18:16:52.643036 [30048] << RfcConnection::openNew returned 0
2020-04-04 18:16:52.643065 [30048] << RfcOpenConnection returned RFC_OK
2020-04-04 18:16:52.762744 [30048] >> RfcInvoke
Invoking .. on handle 2138511726400
>> RfcFunction name=".."
..
<< RfcFunction
2020-04-04 18:16:52.764140 [30048] >> RfcFunction::rfcCallReceive
>> HEXDUMP
2020-04-04 18:16:52.765782 [30048] Sending data for handle 2138511726400, convID 23016689, codepage 4103
..
<< HEXDUMP
2020-04-04 18:16:53.291129 [30048] >> Exception occurred
[Handle: 2138511726400 ConvID: 0 SysID: .. SysNr: 00 Target: .. User: .. Mandant: .. Language: D Codepage: 4103]
Unknown error during RFC call
2020-04-04 18:16:53.291211 [30048] << RfcFunction::rfcCallReceive returned 15
2020-04-04 18:16:53.294320 [30048] << RfcInvoke returned RFC_UNKNOWN_ERROR
RFC_ERROR_INFO.message: Unknown error in lower level library
rfc35832_30680.trc
- client.id
31
2020-04-04 18:16:53.249346 [30680] >> RfcOpenConnection
>> RfcOptions
Partner Char Size: 1
Client: ..
User: ..
Alias:
Passwd: (- secret -)
Language: D
MySAPSSO2:
GetSSO2: 0
X.509 Certificate:
ExtidData:
ExtidType:
SysNr: 00
ASHost: ..
ASService:
GWHost:
GWService: sapgw00
R3Name:
MSHost:
MSService:
WSHost:
WSPort:
TCPHost:
TCPPort:
TLS Cryptolib:
TLS Client PSE:
TLS Server PSE:
TLS Trust All: Off
Proxy Host:
Proxy Port:
Proxy User:
ProgramID:
SAPRouter:
SNC Mode: 0
SNC QOP: 0
SNC Lib:
SNC Myname:
SNC Partnername:
Codepage: 1100
Trace: 3
LogonCheck: 1
Compression: 1
Serialization: 0
Ignore Char Conv Errors: 0
Substitution Char: 35
UseRoundtripOptimization: 0
SO Keepalive: 0
CPIC Keepalive: 0
Delta Manager: 1
<< RfcOptions
2020-04-04 18:16:53.249535 [30680] >> RfcOptions::determineType
determined type RFC_CLIENT
2020-04-04 18:16:53.249584 [30680] << RfcOptions::determineType returned 0
2020-04-04 18:16:53.249613 [30680] >> RfcConnection::openNew
2020-04-04 18:16:53.249667 [30680] >> RfcConnection::connect
2020-04-04 18:16:53.249695 [30680] >> CpicConnection::connectAsClient
2020-04-04 18:16:53.251574 [30680] << CpicConnection::connectAsClient returned 0
>> HEXDUMP
2020-04-04 18:16:53.251665 [30680] Sending data for handle 2138513959152, convID 23018303, codepage 1100
..
<< HEXDUMP
>> HEXDUMP
2020-04-04 18:16:53.256384 [30680] Receiving data for handle 2138513959152, convID 23018303, codepage unknown
..
<< HEXDUMP
RfcConnection handle 2138513959152 established
2020-04-04 18:16:53.257160 [30680] << RfcConnection::connect returned 0
>> RfcConnection name="none"
Handle = 2138513959152 ConvID = 23018303
>> RfcControl name=".."
Program Name: SAPLSYST
Codepage: 4103 Communication Codepage: 1100
SysID: .. Target: .. Mandant: .. SysNr: 00 Language: D User: ..
Conversation ID: 23018303 Role: C753
Partner Rel: 740 Partner Kernel Rel: 749 Partner IP: .. Partner IPv6: .. Protocol version: 3595
Exclusive: false Get SSO2: false Use basXML: 0
<< RfcControl
>> CpicConnection
CPIC Handle: 0
Gateway Host: Gateway Service: AS Host: AS Service: Program ID:
Protocol: I
Conversation ID: 23018303
Last IO Time: Sat Apr 04 18:16:53 2020
Data Received: 3 Status Received: 2 All Received: 1
Connected: true Deallocated: false Accepted: 0
Server Mode: false Registered Mode: false Just Accepted: 0
<< CpicConnection
Asynchron: 0
Has Error: 0
<< RfcConnection
2020-04-04 18:16:53.257545 [30680] << RfcConnection::openNew returned 0
2020-04-04 18:16:53.257588 [30680] << RfcOpenConnection returned RFC_OK
rfc35832_12896.trc
- admin/cleanup?
2020-04-04 18:16:52.643281 [12896] >> RfcGetConnectionAttributes
2020-04-04 18:16:52.643328 [12896] << RfcGetConnectionAttributes returned RFC_OK
2020-04-04 18:16:53.271555 [12896] >> RfcCloseConnection
2020-04-04 18:16:53.271647 [12896] >> RfcConnection::closeHandle
>> RfcConnection name="none"
Handle = 2138511726400 ConvID = 23016689
>> RfcControl name=".."
Program Name: SAPLSYST
Codepage: 4103 Communication Codepage: 1100
SysID: .. Target: .. Mandant: .. SysNr: 00 Language: D User: ..
Conversation ID: 23016689 Role: C753
Partner Rel: 740 Partner Kernel Rel: 749 Partner IP: .. Partner IPv6: .. Protocol version: 3595
Exclusive: false Get SSO2: false Use basXML: 0
<< RfcControl
>> CpicConnection
CPIC Handle: 0
Gateway Host: Gateway Service: AS Host: AS Service: Program ID:
Protocol: I
Conversation ID: 23016689
Last IO Time: Sat Apr 04 18:16:52 2020
Data Received: 0 Status Received: 0 All Received: 1
Connected: true Deallocated: false Accepted: 0
Server Mode: false Registered Mode: false Just Accepted: 0
<< CpicConnection
Asynchron: 0
Has Error: 0
<< RfcConnection
2020-04-04 18:16:53.272157 [12896] >> RfcConnection::disconnect
2020-04-04 18:16:53.272204 [12896] >> CpicConnection::close
2020-04-04 18:16:53.283041 [12896] << CpicConnection::close returned 0
2020-04-04 18:16:53.283110 [12896] << RfcConnection::disconnect returned 0
2020-04-04 18:16:53.283157 [12896] << RfcConnection::closeHandle returned 0
2020-04-04 18:16:53.283184 [12896] << RfcCloseConnection returned RFC_OK
2020-04-04 18:16:53.283285 [12896] >> RfcCloseConnection
2020-04-04 18:16:53.283310 [12896] >> RfcConnection::closeHandle
>> RfcConnection name="none"
Handle = 2138513959152 ConvID = 23018303
>> RfcControl name=".."
Program Name: SAPLSYST
Codepage: 4103 Communication Codepage: 1100
SysID: .. Target: .. Mandant: .. SysNr: 00 Language: D User: ..
Conversation ID: 23018303 Role: C753
Partner Rel: 740 Partner Kernel Rel: 749 Partner IP: .. Partner IPv6: .. Protocol version: 3595
Exclusive: false Get SSO2: false Use basXML: 0
<< RfcControl
>> CpicConnection
CPIC Handle: 0
Gateway Host: Gateway Service: AS Host: AS Service: Program ID:
Protocol: I
Conversation ID: 23018303
Last IO Time: Sat Apr 04 18:16:53 2020
Data Received: 3 Status Received: 2 All Received: 1
Connected: true Deallocated: false Accepted: 0
Server Mode: false Registered Mode: false Just Accepted: 0
<< CpicConnection
Asynchron: 0
Has Error: 0
<< RfcConnection
2020-04-04 18:16:53.283547 [12896] >> RfcConnection::disconnect
2020-04-04 18:16:53.283572 [12896] >> CpicConnection::close
2020-04-04 18:16:53.293954 [12896] << CpicConnection::close returned 0
2020-04-04 18:16:53.294004 [12896] << RfcConnection::disconnect returned 0
2020-04-04 18:16:53.294040 [12896] << RfcConnection::closeHandle returned 0
2020-04-04 18:16:53.294069 [12896] << RfcCloseConnection returned RFC_OK
2020-04-04 18:16:53.301973 [12896] >> RfcDestroyFunction
Deleting function ..
2020-04-04 18:16:53.302058 [12896] << RfcDestroyFunction returned RFC_OK
Checked together with NWRFC SDK experts and RFC_UNKNOWN_ERROR, coming from levels below the SDK, makes it hard to guess what could be wrong here.
I found two critical node-rfc
gaps which might have impact here:
Closing the connection possible while RFC call ongoing
Pool reopen()
method, used when releasing the connection back to Pool, resolves to void/undefined. It should be the Client instance, just like the Pool open()
method
These gaps, leading to issues like #132 and #133, are now fixed in master.
Client locking is also checked in detail and more unit tests added for Pool.
All these fixes will be shipped with next release, as soon the #122 fixed, waiting for https://github.com/prebuild/prebuild/issues/267.
If waiting for prebuild
issue takes longer, please feel free to build from current master: 5837df5
These new added features might be helpful in troubleshooting and custom logging:
Number of RFC calls currently running via the client instance: https://github.com/SAP/node-rfc/blob/master/test/locking/callback.spec.js#L41
Throughput of one or more client instances: https://github.com/SAP/node-rfc/blob/master/test/throughput.spec.js
Client instance created, last open/call/close timestamps: https://github.com/SAP/node-rfc/blob/master/test/throughput.spec.js
Hope this helps :)
Changes are published in v1.1.0
Thanks, it is good to see you're active on these issues (and in touch with the SDK folks) and that there is visible progress. Though the "closing while active" issue should not affect us.
I did not have as much opportunity to test last week as before and did not keep the trace on all the time but -subjectively- the number of incidents seems to have gone down (from at least once per day to less than once per day, over a week). Will update and check whether the additional info helps with logging.
And having the existing fixes published prompts me to update our other servers (I kept them running on 1.0.3 for comparison, and they're still having more frequent issues, including random shutdowns).
The thing that confused me about the last logs I provided was that one call failed and another, interleaved, call got the INVALID_HANDLE - was that a coincidence of two calls failing one after the other with different errors or was there some interaction between one call failing and the other ones error message?
Btw, when browsing the trace files, I see that the "Sending data" entries vary between code pages 4103 and 1100 while the "Receiving data" entries vary between 4103 and unknown - is that normal (I would expect sending/receiving to stick to one known code page each)?
I'm currently a bit confused about the extra outputs, in particular .status
timestamps and ._connectionHandle
s. I am also starting to wonder where the client .id
comes from and how it is expected to behave (as it is used to index the pool).
First, I noticed that some freshly acquired Client
s had lastcall
stamps that were smaller than their lastopened
stamps, then I followed one ._connectionHandle
and saw it reappear with different client ids And the client ids are not monotonically increasing, either.
Slice of log for specific ._connectionHandle
and the .id
s associated with it (in order, each line has client .id
, ._connectionHandle
, .isAlive
, .runningRFCCalls
, .status
):
12:54:18 post acquire ----> 3 (1506563273056) true 0 {"created":1586861658889,"lastopen":1586861658889,"lastclose":0,"lastcall":0}
12:54:20 post release (success) ----> 3 (undefined) false 0 {"created":1586861658889,"lastopen":1586861658889,"lastclose":1586861660586,"lastcall":1586861660114}
14:22:48 post acquire ----> 2 (1506563273056) true 0 {"created":1586861658666,"lastopen":1586866961499,"lastclose":0,"lastcall":1586866961390}
14:22:48 post release (success) ----> 2 (undefined) false 0 {"created":1586861658666,"lastopen":1586866968976,"lastclose":0,"lastcall":1586866968887}
14:22:49 post acquire ----> 2 (1506563273056) true 0 {"created":1586861658666,"lastopen":1586866968976,"lastclose":0,"lastcall":1586866968887}
14:22:49 post release (success) ----> 2 (undefined) false 0 {"created":1586861658666,"lastopen":1586866969261,"lastclose":0,"lastcall":1586866969166}
14:22:49 post acquire ----> 2 (1506563273056) true 0 {"created":1586861658666,"lastopen":1586866969261,"lastclose":0,"lastcall":1586866969166}
14:22:49 post release (success) ----> 2 (undefined) false 0 {"created":1586861658666,"lastopen":1586866969861,"lastclose":0,"lastcall":1586866969830}
14:22:50 post acquire ----> 2 (1506563273056) true 0 {"created":1586861658666,"lastopen":1586866969861,"lastclose":0,"lastcall":1586866969830}
14:23:00 post release (success) ----> 2 (undefined) false 0 {"created":1586861658666,"lastopen":1586866969861,"lastclose":1586866980632,"lastcall":1586866970450}
Here is an INVALID_HANDLE with the latest versions (again, trace wasn't on, but some of the new logging options were)
versions:
{"major":7500,"minor":0,"patchLevel":6,"binding":"1.1.0"}
connectionInfo:
{
"dest": "",
"host": "-",
"partnerHost": "-",
"sysNumber": "00",
"sysId": "-",
"client": "-",
"user": "-",
"language": "D",
"trace": "0",
"isoLanguage": "DE",
"codepage": "4103",
"partnerCodepage": "4103",
"rfcRole": "C",
"type": "E",
"partnerType": "3",
"rel": "753",
"partnerRel": "740",
"kernelRel": "749",
"cpicConvId": "89864650",
"progName": "SAPLSYST",
"partnerBytesPerChar": "2",
"partnerSystemCodepage": "4103",
"partnerIP": "-",
"partnerIPv6": "-"
}
given that previous traces showed code pages 4103 and 1100, is it odd that there is only 4103 here?
As the connectionHandle sees some reuse, here is a log slice wrt that handle, ending in an INVALID_HANDLE directly on acquire
release
acquire
, the same handle ist acquire
d with different idsacquire
, the timestamps change even if the handle does not
18:31:23 connection, post acquire----> 3 (3069229760224) true 0 {"created":1586881883129,"lastopen":1586881883129,"lastclose":0,"lastcall":0} {connectionInfo}
18:31:23 RFC F1 called
18:31:25 RFC F1 (after 2.821s) returns
18:31:26 connection, post release (success) ----> 3 (3069229760224) true 0 {"created":1586881883129,"lastopen":1586881883129,"lastclose":1586881886000,"lastcall":1586881883176} {}
19:12:24 connection, post acquire----> 4 (3069229760224) true 0 {"created":1586884344466,"lastopen":1586884344466,"lastclose":0,"lastcall":0} {connectionInfo} 19:12:24 RFC F1 called 19:12:26 RFC F1 (after 2.269s) returns 19:12:26 connection, post release (success) ----> 4 (undefined) false 0 {"created":1586884344466,"lastopen":1586884344466,"lastclose":1586884346760,"lastcall":1586884344489} {}
19:12:29 connection, post acquire----> 5 (3069229760224) true 0 {"created":1586884349515,"lastopen":1586884349516,"lastclose":0,"lastcall":0} {connectionInfo} 19:12:29 RFC F2 called 19:12:30 RFC F2 (after 1.058s) returns 19:12:30 connection, post release (success) ----> 5 (3069229760224) true 0 {"created":1586884349515,"lastopen":1586884350590,"lastclose":0,"lastcall":1586884349531} {}
19:12:30 connection, post acquire----> 5 (3069229760224) true 0 {"created":1586884349515,"lastopen":1586884350590,"lastclose":0,"lastcall":1586884349531} { "name": "RfcLibError", "code": 13, "key": "RFC_INVALID_HANDLE", "message": "An invalid handle 'RFC_CONNECTION_HANDLE' was passed to the API call" }
Thanks a lot for the very valuable input.
Btw, when browsing the trace files, I see that the "Sending data" entries vary between code pages 4103 and 1100 while the "Receiving data" entries vary between 4103 and unknown - is that normal (I would expect sending/receiving to stick to one known code page each)?
Yes, that’s normal. No need to worry, the codepages are handled automatically between NWRFC SDK and RFC server system.
I am also starting to wonder where the client .id comes from and how it is expected to behave (as it is used to index the pool).
The client id comes from N-API/C wrapper and incremented with each new Client instance. After opening the connection, the client gets the connection handle. The handle can change after re-opening the connection, id stays the same. The reopen happens after the client released back to pool and acquired for new task, for example.
The thing that confused me about the last logs I provided was that one call failed and another, interleaved, call got the INVALID_HANDLE - was that a coincidence of two calls failing one after the other with different errors or was there some interaction between one call failing and the other ones error message?
This could be the actually the cause of getting RFC_INVALID_HANDLE. After certain errors, the connection is automatically closed and cleaned-up by NWRFC SDK. The node-rfc was so far in this regard transparent, only updating isAlive
flag. Checking this flag before each call in NodeJS is not enough because one client can start multiple client.call()
calls in parallel and all would see alive === true
. After one call fails with critical error, the connection is broken, not anymore useful for other pending calls. When the next call comes for execution, the RFC_INVALID_HANDLE is raised.
With the latest commit 5794944, the node-rfc automatically re-opens the connection, after critical errors. This should eliminate or at least considerably reduce RFC_INVALID_HANDLE errors.
If the problem persists, I would try to reproduce by using the same stack (any other components, frameworks on top of node-rfc?) and calling RFC_PING_AND_WAIT
and RFC_RAISE_ERROR
ABAP modules, to simulate some real ABAP processing.
After a full day of testing 1.1.0 on two servers, that version definitely generates a lot more RFC_INVALID_HANDLEs, but almost all of them come immediately after the acquire
, the rest from reused handles. So that would support your theory. Will update to the current HEAD and test more.
I do find it odd that acquire
would simply return something that looks like a client connection but is not valid. Reopening connections that are likely invalid should avoid this, but shouldn't acquire
have returned or thrown a clear error condition instead of returning an invalid handle with an error hidden in the connectionInfo?
Still getting a lot of RFC_INVALID_HANDLEs with e7c1fba9e15d4492af2335c381788dee6dc03db8 , mostly after acquire
:-(
Could there be an option for not reusing connections (ie not putting the reopen
ed client back into the pool after release
but just closing it), so that acquire
always returns fresh clients?
The reason for empty connectionInfo is still open and discussed with NWRFC SDK experts but hard to guess without reproducible example. Which components/frameworks are used on servers, on top of Pool? How the typical application looks like? I would try to make more stress tests, with similar setup.
You may test also the Pool without reuse of released connections 9833196.
In parallel I will check if Pool functionality could be moved to C, making it less dependable on Promises implementation.
I observed some unit test behave differently when running from the command line and via npm script, executing exactly the same command line. Sometimes, via npm script, the tests were failing because the client.id could not be referenced because the client is undefined. Isolated tests and all tests via command line were always passing and failing only when running all tests via npm script.
Suspecting it could be somehow related to Promise used in Pool internal active/ready collections, I replaced them with Client instances, reducing the overhead.
The npm testing issues are now gone and perhaps it could improve the situation in your environment: a6ec7ef
If not, the Pool shall be moved to C level, becoming independent of Promise implementations on different platforms.
I'm still observing acquire
returning a client
where getting the connectionInfo
will return an INVALID_HANDLE error and where the _connectionHandle
has been seen before in the same server run (even though lastcall
is 0
and the id
is different):
21:10:36 connection, post acquire ----> 54 (2783525369200) true 0 {"created":1587582636762,"lastopen":1587582636762,"lastclose":0,"lastcall":0} {"major":7500,"minor":0,"patchLevel":6,"binding":"1.2.0"} {
"alive": true,
"name": "RfcLibError",
"code": 13,
"codeString": "RFC_INVALID_HANDLE",
"key": "RFC_INVALID_HANDLE",
"message": "An invalid handle 'RFC_CONNECTION_HANDLE' was passed to the API call"
}
As before, the INVALID_HANDLE in getting the connectionInfo
predicts/precedes the same error on the actual invoke
that follows.
I've collected 2 somewhat longer log slices, hoping to see some connections.
As before, the individual calls proceed as follows:
acquire
invoke
return
info + release (success)
release (error)
+ error message + return
infoacquire
and release
have an await
, invoke
a callback (the release
calls are in this callback).
post acquire
and release
, we log client .id
, ._connectionHandle
, .isAlive
, .runningRFCCalls
, .status
, and .connectionInfo
. The client.id is used to track the other events through interleaved calls.
The most common failure pattern is still:
P1.
acquire
, .connectionInfo
gives an INVALID_HANDLE error,invoke
raises the same errorHowever, a couple of other patterns can also be observed:
P2. 2020-04-24 11:51:01
acquire
for id 27, we have a client with INVALID_HANDLE error in .connectionInfo
acquire
for id 28 returns the very same ._connectionHandle
(27 has not been released at this point)invoke
for 28 results in a communication failure{
"alive": true,
"name": "RfcLibError",
"code": 1,
"codeString": "RFC_COMMUNICATION_FAILURE",
"key": "RFC_COMMUNICATION_FAILURE",
"message": {
"message(split at line breaks)": [
"LOCATION CPIC (TCP/IP) on local host with Unicode",
"ERROR program state check for conversation 33641697",
"TIME Fri Apr 24 13:35:57 2020",
"RELEASE 753",
"COMPONENT CPIC (TCP/IP) with Unicode",
"VERSION 3",
"RC 471",
"MODULE D:/depot/bas/753_REL/src/krn/si/cpic/r3cpic.c",
"LINE 6378",
"DETAIL called function STSEND in state state=BUFFER_DATA2",
"COUNTER 8"
]
}
}
P3. 2020-04-24 13:00:37
acquire
for id 154, we have a client with INVALID_HANDLE error in .connectionInfo
invoke
for 154 succeeds!Pattern 3 repeats 2020-04-24 13:30:27, id 16, and 2020-04-24 13:30:28, id 17, 2020-04-24 13:33:03, id 19, 2020-04-24 13:33:05, id 20, all with the same ._connectionHandle
.
Then, we get another instance of pattern 2 (2020-04-24 13:33:12, identical ._connectionHandle
concurrently used for 2 client.ids, 22 and 23, leading to RFC_COMMUNICATION_FAILURE.
Back to pattern 3 for a while, then another pattern 2, this time with 3 ids that share the same ._connectionHandle
(2020-04-24 13:35:54, ids 32, 33, and 34). The acquire
for 34 does not come with an INVALID_HANDLE error but the invoke
for 34 results in RFC_COMMUNICATION_FAILURE.
One instance of pattern 3 and then we're back to pattern 1 again.
Todays log has the same 3 failure patterns. Pattern 3 means that an INVALID_HANDLE error for .connectionInfo
after acquire
is not a certain indicator of invoke
failure, but currently, I'm more curious how several consecutive and even concurrent acquire
s can end up with the same ._connectionHandle
, with differing .connectionInfo
s.
The frequent reuse of ._connectionHandle
s suggests that the .status
might better be associated with handles, not with clients.
I cannot see any obvious culprit in node-rfc (I had several theories, none of which survived further inspection;-), so either there is a non-obvious issue (perhaps via node-api, or your suspected interaction of Promise implementations) or the issue may be in the SDK. This server is build via CI and does not yet include your Promise-avoidance commit but when I log Promise
, I get [native code]
.
logslice-2020-04-24.txt
logslice-2020-04-27.txt
Hi again,
I've finally managed to produce a tiny example capable of reproducing the issue without involving our project code. It is not completely automated, as the issue is timing-related and -so far- needs a human touch to trigger.
The example (index.js
+ package.json
) is a small express server with only a single route, which -via node-rfc- calls a standard ABAP function FUNCTION_IMPORT_DOKU
to describe itself
npm install
./sapnwrfc.ini
with DEST=DEV
RFC_INI=./sapnwrfc.ini node index.js
(tried with node v14.1.0 on windows 10, using node-rfc@1.2.0)
It seems important not only that the RFC call returns a non-trivial result, but that the JS thread is blocked for a while logging said result while new HTTP requests keep coming in.
Hi @bsrdjan It has been quiet here these last 3 weeks - have you been able to reproduce the issue using the mini-project I provided last week? Any ideas/suggestions on how to proceed?
Hi @clausreinke and sorry for the longer waiting. I just did a brief test and can investigate in detail next week. Thank you very much for this example, it helps reproducing on Linux. Did not try on Windows but expect the same result. I slightly modified the test setup as follows:
$ RFC_INI=./sapnwrfc.ini node index.js > log.txt &
$ for i in {1..10}; do curl localhost:3456/doc/function; done
Please find logs with RFC_INVALID_HANDLE attached, need to investigate in detail: logs.zip
Hi again; I had another look through the node-rfc
code and could still not find any obvious trouble spots. So I was wondering whether the code might tell a higher-level story that somehow could break down at a lower level (so we'd only think everything is fine).
So far, I've only found one potential issue along those lines. Perhaps you've covered it already but as we're stuck, I thought I'd mention it.
Background: according to the programming guide, the Sap NW RFC library is thread-safe, but for communication and table handles. node-rfc
covers table handles by copying the data and communication handles by using a semaphore invocationMutex
.
potential issue: each invocationMutex
is associated with a node-rfc
Client
, not with the underlying communication handle (should there be a map from handles to locks instead?)
RfcInvoke
, the code already protects against this case by opening an new connection if the old one was closed due to errorRfcInvoke
but the error codes are identical for other API-calls that involve handles - do they need the same protection? For instance, what happens if RfcGetFunctionDesc
runs into an error that closes the connection handle?question: client.connectionInfo
calls RfcGetConnectionAttributes
without locking; it is read-only, so this seems reasonable but could it be possible for other threads to change the connection handle while RfcGetConnectionAttributes
is running? Could that cause issues?
Yay!-)
Following the theory from #147, I've instrumented node-rfc
to log, in the Client
deconstructor, which client id closes which handle; I've also activated the LOCK logging, add client id and handle there, too; then I've re-run the reproduction example:
In many cases, at least, this does indeed seem to be the cause of trouble, because a client that previously owned a handle closes it again when it has already been reopened by another client - that other client then fails because the handle has disappeared.
I suggest to keep track of whether a client has lost control of its handle
As the latter will happen in Execute
, this is probably where the checks will have to be (don't call anything if handle control has been lost, after call, check whether handle control has been lost due to error). If neccessary/helpful, OnOK
can reestablish control by opening a new handle. And the deconstructor must not try to close a handle it has lost control of.
Hi @clausreinke,
thank you for the thoughtful analysis and the patience with this issue.
Based on you example, the test setup with express is in invalid_handle
folder and the issue is reproducible with run.sh
script.
The test is based on your example, with following added:
1) The node-rfc
dev version used, for faster testing
2) The client http requests simulated by curl/wget/...
3) The client http request number can be optionally passed to express and used in tracing of Pool states during requests.
The test with let say 20 requests, creates 5 RFC trace files, indicating 5 threads consuming the Pool. The current implementation of the Pool can't handle that safely because there is no lock/mutex to synchronise taking one connection from pool and giving it back. That must be protected.
Inspecting RFC traces, I saw connection handles occurring after closing the connection and learned that handles are not necessarily unique during the session.
The handle is just the pointer to the C ++ object, and after a free / delete the C-Runtime can re-use the same address again in a subsequent malloc / new. It happens very often (especially on Windows), that ever smarter memory management system “notices” the block of memory is just the right size and the block is re-used, instead of allocating a new one, causing unnecessary fragmentation.
The synchronised consumption of connections fails normally in a following way. The Thread A opens connection x and does something with it. The call ends in SysFailure / CommFailure, so that the connection breaks (and is deleted again). The application does not pay attention to the return code and tries to continue and use actually broken connection. This normally results in INVALID_HANDLE.
In current Pool implementation, the connection is closed when released back to Pool, as attempt to prevent reusing eventually broken connection. But even that is not possible with synchronised threads. The close in release shall be definitely removed and connections in release closed only if the total > Pool.max.
The traces of the current Pool implementation show something like: Before Thread A "uses" handle x, Thread B comes along and also opens a connection. By chance, the C-Runtime also takes the address x again. So now the deal is "valid" again. Now thread A continues, just using the connection that thread B just opened for itself ... (It took a while until debugged and understood ...)
The lower the frequency with which connections are opened and closed again, the greater is statistical probability that this will happen.
I would therefore move the Pool to C, synchronise with existing Client mutex, in next couple of days and re-test :)
I'm not sure I understand you correctly, in particular, how moving the connection pool from js to c would help with the multiple threads. There should be only a single thread evaluating js callbacks, the other threads should be from a libuv thread pool, accessed via napi async workers; according to docs, only the Execute
method actually runs outside the main thread (main thread being the event loop running js code), on one of the libuv worker threads.
The client semaphore only works for the connection as long as the latter is open - once the connection has been closed, be it by the client or by the backend or by the rfc library, the client must not use it again, semaphore or not; in particular, the close in the client destructor (while necessary) must be guarded, not by whether the connection is currently valid, but by whether it has already been closed since this client opened it.
Even if a client successfully closes its connection, it must not try to do so again in its destructor, because another client (using a different semaphore) may have opened the same connection in the meantime.
It took more time and here the re-designed version: v2-pre branch.
Please feel free to test, review and share the feedback. The most important changes are in the changelog and in documentation: direct-and-managed-clients
The version 2.0 published, should fix the issue.
Just checking the status here. Any updates in the meantime, any issues left ?
Let close this one and open new issue of needed.
Can someone provide me a mirror link for nwrfc750P ?
Can someone provide me a mirror link for nwrfc750P ?
There are no mirrors, SAP Support Portal is the only distribution channel for SAP NW RFC SDK
https://support.sap.com/en/product/connectors/nwrfcsdk.html
Information on where to download the 7.50 version of the SAP NetWeaver RFC SDK and information about currently supported platforms can be found in SAP Note 2573790.
As much as we like the library, we're having a persistent problem in using it, namely
We cannot reliably reproduce the error and as of yet, no pattern is apparent in what triggers the bug, but it happens more than once per day in light usage.
Lacking a reproducable example, we're looking for anything useful we could log to help narrow down the problem when it occurs (suggestions are welcome - currently, we log
client.id
,client.isAlive
andclient.connectionInfo
) and we've found a first pattern. It does not cover all cases but sufficiently many to be interesting:It appears that sometimes the
connectionInfo
goes missing in acquired connections (logged betweenconst client = await connectionPool.acquire()
andclient.invoke(..)
).We have a global connection pool, constructed once, via
new Pool(sapConnectionParameters, { min: 1, max: -42 });
(the minimum of 1 avoids obscuring the issue with multiple connections) and neither the pool nor thesapConnectionParameters
are changed after construction (a shallow copy is made of the latter to ensure this). For every SAP call, we acquire a connection/client, check that it and itsid
are defined, then callclient.invoke(..)
and -in its callback- we end withconnectionPool.release(client)
(in both success and failure cases).Sometimes, between
const client = await connectionPool.acquire()
andclient.invoke(..)
, theclient.connectionInfo
logs as{}
, followed by anRFC_INVALID_HANDLE
error from the.invoke
.We're stumped as to how the
connectionInfo
could be empty at that point - does this give you any hint about what could be going wrong (perhaps a rogue deconstruction)?You mentioned in https://github.com/SAP/node-rfc/issues/99#issuecomment-523369638 that
We don't use any threads ourselves and since
nodejs
is single-threaded for the JavaScript part, the only threads in play should be the worker threads (Napi::AsyncWorker
) used in thenode-rfc
implementation. Of course, multiple async calls can be interleaved.OS: Windows 10 node: node-v12.14.1-win-x64 SAP NW RFC SDK: nwrfc750P_6-70002755 node-rfc: node-rfc@1.0.3