Open eslavich opened 4 years ago
One of the use cases for streaming writes is something that is generating data (time series for example) for which the length is not known ahead of time, where it is not possible to insert the information at the beginning of the file (it may be going over a network pipe) and may be determined by outside events (user hits the stop button, battery dies, etc.)
That it may be confused with a accidental termination, though I suppose if we expect a terminating index and it isn't there, that may be used to indicate an error condition (the data up to that point is presumed good, but incomplete).
In supporting archival uses, perhaps we require an update to the file by some software on archive end to make it more robust?
Just a thought: what about adopting @eslavich's changes, but also write the block length at the end of the block for the special case of streaming blocks? A special value in the block header (like -1 in the length field) could flag that the block is a special one that stores its length at the end.
The benefit of this is that one could write a streaming block of unknown size over, e.g. the network, while still having the reader-friendly capability of detecting truncations in the future (which would probably show up as a garbage size).
One drawback is the complexity of introducing a potential new location for metadata. Another is that raw concatenation of new binary data on to the end of the ASDF file would no longer be supported, but at least this would allow some form of streaming writes.
The ASDF standard seems optimized for the convenience of file writers in that:
The first feature is intended to facilitate streaming writes, since the size of a compressed block may not be known up-front and writing the block index at the end of the file allows the block index to be streamed out when the block sizes are all known. This may be convenient for writers which would otherwise have to go back and overwrite an earlier part of the file if the block index were located elsewhere. The downside is that readers need to consume a file backwards if they want to read the block index early, or "skip along" the blocks header to header to get to the one they want.
The second feature allows writers to stream output when the length of the binary block is not known ahead of time. This seems downright dangerous for readers, who won't be able to detect accidental file truncation.
Is this the right balance of compromises? For an archival format, it may not be desirable to prioritize the convenience of writers over the convenience and safety (data-integrity wise) of readers. I think we should consider the following changes:
The negative consequence of these changes is that ASDF file writers that do not know the lengths of their blocks ahead of time (due to compression or other reasons) would have to rewind and overwrite the block index and block header length field after the block data was written. This would necessitate writing to a storage medium that supports seeking backwards, e.g., would prevent streaming the file to a cloud storage service without first writing it temporarily to memory or to disk.
The benefit is that file readers would be able to consume the YAML document and block index and then know exactly what byte offset to seek forward to to begin reading a given binary block. There wouldn't be any question as to whether the final block had been truncated, because it would always include its own length in its header.
@perrygreenfield