JasonAlt / GridFTP-DSI-for-HPSS

GridFTP module that allows the Globus server to work with HPSS
Other
6 stars 2 forks source link

Detect and document MVR_CLIENT_TIMEOUT #45

Closed JasonAlt closed 5 years ago

JasonAlt commented 5 years ago

Large file transfers to/from HPSS tend to span multiple sets of HPSS mover processes. Each set is responsible for a large contiguous chunk of the file transfer. First set transfers offsets 0-N, second set transfers (N+1)-M, and so on. These mover sets are all initialized at the beginning of the transfer.

Any mover will timeout after MVR_CLIENT_TIMEOUT seconds (defaults to 15 minutes). If a mover set does not start the transfer within this timeout, the entire transfer aborts. This is an HPSS issue, not a DSI issue.

Symptoms looks like this:

2019-06-12 14:49:47 ENDPOINT_ERROR Error (transfer) Endpoint: XXXX HPSS Archive (e38ee901-6d04-11e5-ba46-22000b92c6ec) Server: XXXX:2811 File: /~/scratch_backups/XXXX Command: STOR ~/scratch_backups/XXX Message: Fatal FTP response --- Details: 451-GlobusError: v=1 c=INTERNAL_ERROR\r\n451-GridFTP-Errno: 5011\r\n451-GridFTP-Reason: System error in hpss_PIOExecute\r\n451-GridFTP-Error-String: \r\n451 End.\r\n
2019-06-12 14:59:39 TIMEOUT Error (transfer) Endpoint: XXXX HPSS Archive (e38ee901-6d04-11e5-ba46-22000b92c6ec) Server: XXXX:2811 Command: STOR ~/scratch_backups/XXXX Message: The operation timed out --- Details: Timeout waiting for response
2019-06-12 15:19:44 ENDPOINT_ERROR Error (transfer) Endpoint: XXXX HPSS Archive (e38ee901-6d04-11e5-ba46-22000b92c6ec) Server: XXXX:2811 File: /~/scratch_backups/XXXX Command: STOR ~/scratch_backups/XXXX Message: Fatal FTP response --- Details: 451-GlobusError: v=1 c=INTERNAL_ERROR\r\n451-GridFTP-Errno: 5011\r\n451-GridFTP-Reason: System error in hpss_PIOExecute\r\n451-GridFTP-Error-String: \r\n451 End.\r\n

Endpoint error messages (resulting from the mover timeout) appear after 20 minutes. Followed by a TIMEOUT in 10 minutes (due to open() on the file hanging while resources are released from the mover timeout).

JasonAlt commented 5 years ago

Updates pushed in latest PR.