HPSS: Support resumption of interrupted transfers

JasonAlt commented 5 years ago

For standard DSIs, when files are transferred (STOR) in extended block mode, the receiving end will send restart markers at points where (1) the data received safely resides on persistent media and (2) the underlying storage technology is capable of resuming writes at that offset. In the POSIX world, the receiving end could (hypothetically) send restart markers after every buffer. In the case of an error, the client could resume the transfer at the latest received restart marker.

In HPSS, simply writing a buffer to HPSS with PIO is not sufficient to satisfy the two points above. The restart point within a transfer is determined by the complex inter-workings of HPSS. The only reliable point of restart is returned by hpss_PIOExecute(). However, there was a HPSS bug in 7.4.3, BZ4719, PIOExecute() returns the wrong value for bytesmoved on error in pre 7.5:

We need a small fix to hpss_PIOExecute() in order to support transfer restarts on error while writing to HPSS. When hpss_PIOExecute() exits with error, we need BytesMoved to be set. Currently, the values of bytes_moved (local variable) is computed for error or success but the value is not returned to the caller on error. If we had this value, GridFTP could send a restart marker and transfers would resume where they left off.

Currently, site's must disable REST within the gridftp configuration files in order to avoid using a restart marker which could be erroneous. The result could be a file received with a 'gap'. If post transfer checksums are not enabled, this could go undetected. With checksums enabled, the entire file would transfer, compute checksum, detect a corrupt file and restart. For large files, this is quite painful.

This work is part of globusonline/product-management#388

JasonAlt commented 5 years ago

Tracking requirements necessary for providing restart functionality:

Need to know the latest status of BZ4719. How can site's communicate the need for this fix to their HPSS support? How can site's determine if the fix has been applied locally and is functioning correctly?
[OPTION] Force HPSS to give restart markers on large files by writing in smaller chunks.
Provide sane fix for sites without the BZ4719 fix applied.
Add detection and logging of obviously erroneous bytesmoved from PIOExecute()
Add debug logging for validating PIOExecute() restart values
Updated documentation to supports sites in the transition to enabling restarts

JasonAlt commented 4 years ago

Fixed in PR #68

JasonAlt / GridFTP-DSI-for-HPSS

HPSS: Support resumption of interrupted transfers #44