MDSplus / mdsplus

The MDSplus data management system
https://mdsplus.org/
Other
74 stars 44 forks source link

Occasional errors in put operation in distributed client configuration #1675

Closed GabrieleManduchi closed 5 years ago

GabrieleManduchi commented 5 years ago

Somethimes (seldomly) I get the following message in distributed client configuration:

[mdsplus@scpsl ~]$ Error in SendArg: mode = 1, status = 65554 Error in SendArg: mode = 6, status = 65554 Error in SendArg: mode = 6, status = 65554 Error writing device data: DatabaseException: %TREE-E-FAILURE, Operation NOT successful null

This happens when writing values. In distributed client configuration. Both client and mdsip server run the alpha version.
Unfortunately I am not able to replicate it on a simple example, and it appears seldomly.

As Tom observed, the status is strange. It looks like an error that occurs when performing a remote lock, but the error code is a generic MDSPLUS-ERROR(65554) and not TreeLOCK_FAILURE

tfredian commented 5 years ago

I was mistaken about the TreeLOCK_FAILURE. The error is happening in SendArg as the message depicts. The TreeLOCK_FAILURE would only happen if the lock operation was successfully sent to the remote mdsip process and it failed during file locking and successfully reported that status back to the client. The fact that it is reporting an MDSplusERROR points toward a problem sending the mdsip message to the server. It seems we might want to either add some error messages to SendBytes in SendMdsMsg.c or come up with more descriptive status codes based on errno values return by the socket communications. This could be caused by a network issues but without seeing the errno value it would be difficult to diagnose. Possibly a simple perror in SendBytes prior to returning MDSplusERROR would help.

tfredian commented 5 years ago

Looking more closely at the mdstcpip/SendMdsMsg.c code it is checking for bytes_sent to be <= 0 and reporting an error if that is true. In some cases a system can become overloaded and it's TCP stack buffers fill and the send() function can return 0 bytes sent. Perhaps the SendBytes routine should not treat 0 as an error but instead put in a small wait and continue trying.

tfredian commented 5 years ago

I am adding print messages for errors occurring during SendMdsMsg which will hopefully provide more information on the cause of these failures.

zack-vii commented 5 years ago

In fact if send is blocking it will never return 0 unless the socket is closed. it can only return 0 if it ran into a timeout with is not expected since io->send is blocking w/o timeout in contrast to io->send_to. if a send returns 0, this is an indicator for a disconnection.

zack-vii commented 5 years ago

is this still an issue, do we have a TestCase to this issue.

GabrieleManduchi commented 5 years ago

This is what I have fixed with PR 1715

That PR is however still pending waiting the tests to be passed (they fail randomly, so I suspect that here is something wrong with the procedure itself)

As soon as the PR passes tests and is merged, I will close this issue