Large file transfers to/from HPSS tend to span multiple sets of HPSS mover processes. Each set is responsible for a large contiguous chunk of the file transfer. First set transfers offsets 0-N, second set transfers (N+1)-M, and so on. These mover sets are all initialized at the beginning of the transfer.
Any mover will timeout after MVR_CLIENT_TIMEOUT seconds (defaults to 15 minutes). If a mover set does not start the transfer within this timeout, the entire transfer aborts. This is an HPSS issue, not a DSI issue.
Symptoms looks like this:
2019-06-12 14:49:47
ENDPOINT_ERROR
Error (transfer) Endpoint: XXXX HPSS Archive (e38ee901-6d04-11e5-ba46-22000b92c6ec) Server: XXXX:2811 File: /~/scratch_backups/XXXX Command: STOR ~/scratch_backups/XXX Message: Fatal FTP response --- Details: 451-GlobusError: v=1 c=INTERNAL_ERROR\r\n451-GridFTP-Errno: 5011\r\n451-GridFTP-Reason: System error in hpss_PIOExecute\r\n451-GridFTP-Error-String: \r\n451 End.\r\n
2019-06-12 14:59:39
TIMEOUT
Error (transfer) Endpoint: XXXX HPSS Archive (e38ee901-6d04-11e5-ba46-22000b92c6ec) Server: XXXX:2811 Command: STOR ~/scratch_backups/XXXX Message: The operation timed out --- Details: Timeout waiting for response
2019-06-12 15:19:44
ENDPOINT_ERROR
Error (transfer) Endpoint: XXXX HPSS Archive (e38ee901-6d04-11e5-ba46-22000b92c6ec) Server: XXXX:2811 File: /~/scratch_backups/XXXX Command: STOR ~/scratch_backups/XXXX Message: Fatal FTP response --- Details: 451-GlobusError: v=1 c=INTERNAL_ERROR\r\n451-GridFTP-Errno: 5011\r\n451-GridFTP-Reason: System error in hpss_PIOExecute\r\n451-GridFTP-Error-String: \r\n451 End.\r\n
Endpoint error messages (resulting from the mover timeout) appear after 20 minutes. Followed by a TIMEOUT in 10 minutes (due to open() on the file hanging while resources are released from the mover timeout).
Large file transfers to/from HPSS tend to span multiple sets of HPSS mover processes. Each set is responsible for a large contiguous chunk of the file transfer. First set transfers offsets 0-N, second set transfers (N+1)-M, and so on. These mover sets are all initialized at the beginning of the transfer.
Any mover will timeout after MVR_CLIENT_TIMEOUT seconds (defaults to 15 minutes). If a mover set does not start the transfer within this timeout, the entire transfer aborts. This is an HPSS issue, not a DSI issue.
Symptoms looks like this:
Endpoint error messages (resulting from the mover timeout) appear after 20 minutes. Followed by a TIMEOUT in 10 minutes (due to open() on the file hanging while resources are released from the mover timeout).