Closed GabrieleManduchi closed 5 years ago
I was mistaken about the TreeLOCK_FAILURE. The error is happening in SendArg as the message depicts. The TreeLOCK_FAILURE would only happen if the lock operation was successfully sent to the remote mdsip process and it failed during file locking and successfully reported that status back to the client. The fact that it is reporting an MDSplusERROR points toward a problem sending the mdsip message to the server. It seems we might want to either add some error messages to SendBytes in SendMdsMsg.c or come up with more descriptive status codes based on errno values return by the socket communications. This could be caused by a network issues but without seeing the errno value it would be difficult to diagnose. Possibly a simple perror in SendBytes prior to returning MDSplusERROR would help.
Looking more closely at the mdstcpip/SendMdsMsg.c code it is checking for bytes_sent to be <= 0 and reporting an error if that is true. In some cases a system can become overloaded and it's TCP stack buffers fill and the send() function can return 0 bytes sent. Perhaps the SendBytes routine should not treat 0 as an error but instead put in a small wait and continue trying.
I am adding print messages for errors occurring during SendMdsMsg which will hopefully provide more information on the cause of these failures.
In fact if send is blocking it will never return 0 unless the socket is closed. it can only return 0 if it ran into a timeout with is not expected since io->send is blocking w/o timeout in contrast to io->send_to. if a send returns 0, this is an indicator for a disconnection.
is this still an issue, do we have a TestCase to this issue.
This is what I have fixed with PR 1715
That PR is however still pending waiting the tests to be passed (they fail randomly, so I suspect that here is something wrong with the procedure itself)
As soon as the PR passes tests and is merged, I will close this issue
Somethimes (seldomly) I get the following message in distributed client configuration:
[mdsplus@scpsl ~]$ Error in SendArg: mode = 1, status = 65554 Error in SendArg: mode = 6, status = 65554 Error in SendArg: mode = 6, status = 65554 Error writing device data: DatabaseException: %TREE-E-FAILURE, Operation NOT successful null
This happens when writing values. In distributed client configuration. Both client and mdsip server run the alpha version.
Unfortunately I am not able to replicate it on a simple example, and it appears seldomly.
As Tom observed, the status is strange. It looks like an error that occurs when performing a remote lock, but the error code is a generic MDSPLUS-ERROR(65554) and not TreeLOCK_FAILURE