OutpostUniverse / OP2Utility

C++ library for working with Outpost 2 related files and tasks.
MIT License
4 stars 0 forks source link

Stream Slicing #72

Open DanRStevens opened 6 years ago

DanRStevens commented 6 years ago

I wanted to start a discussion on stream slicing. In particular, I wanted to go over why stream slicing could be helpful, as that has a major impact on how it is designed, implemented, and used.

The major benefit of slicing, is isolation. It's an API enforced boundary to data access.

I think the clearest example of why stream slicing would be useful is to consider a compressed ZIP/RAR/7Z file with various types of files. Each file type may have its own stream parser to handle the data, and those stream parsers may be written later than the ZIP parser, or be unknown to it.

Some of the file parsers may be a bit buggy or ill behaved. In particular, some of them may have been written to only consider the case of reading a single file from disk, and processing the entire contents. There may be no header data to indicate the data length, or the header data is ignored. This might actually be typical behaviour for some movie file parsers, particularly ones that play past the stated end of the file. These parsers might rely on hitting the end of stream to know it is done processing. Others might try to read a file from disk, such as a graphics file format, but the parser is old and can't handle a newer extension to the file format, and so aborts with an error before reading all of the data. Any data unexpectedly left in the stream might have unintended consequences for future stream processing.

There might also be security concerns, where some internal files contain sensitive info, perhaps passwords, and you want to protect that data from unknown parsers that shouldn't have access to it. Perhaps one buggy or ill behaved parser reads past the end of its data, and fails on the contents at the start of the next file, containing passwords, and those passwords end up getting dumped to a publicly readable log file detailing the parse error.

This is why you want isolation. It puts a limit on the damage that bad parsers can do. It prevents problems processing one internal file, from spreading to the processing of other internal files.

Slicing is able to enforce this isolation without doing wasteful intermediate steps, such as extracting to a temporary file and then running the parser on that. It sidesteps the issue of cleaning up temporary files, particularly after bugs, crashes, or power loss.

Side note: Slicing maps nicely over the concept of solid archives. In a solid archive, it's as if all files are concatenated into one big stream before being compressed, with only the index containing information to split that data apart into individual files. In such an archive, the compressed form of a file might not even end on a byte boundary. The compression scheme might use a variable number of bits for various codes, and end 3 bits into a byte, with the remaining 5 bits being part of the compressed stream for the next file. In this case, one solid stream is fed into a decompressor, and the output stream is sliced into files.

Brett208 commented 6 years ago

I don't have a lot of experience here. Maybe one way to support this better is to create a function on MemoryStreamReader that creates a new MemoryStreamReader from an existing one with more restrictive boundaries? We could also create a function of FileStreamReader that returns a FileSliceReader in a similar fashion?

We are using the Lzh compression in VolFile. Maybe that is a place to start experimenting with how we want the stream library to interact with some sort of compression algorithm?

void VolFile::ExtractFileLzh(std::size_t fileIndex, const std::string& pathOut)

Brett208 commented 5 years ago

Do we want to close this issues since we have the ability to slice both a memory and a file stream? We can even re-slice an existing FileSliceReader.

DanRStevens commented 5 years ago

We do indeed have a working usable slice implementation now.

I kind of feel there are still some design considerations that were never discussed. Though perhaps some of those considerations are more generically related to the stream design, rather than just the slicing aspect of it. In terms of slicing though, I feel there is more that can be done with the design.


One thing I'd like to support, is having a generic way to slice any stream, even if it doesn't provide it's own slice implementation. In particular, the FileSliceReader class has very little of its implementation that depends on the specifics of FileStreamReader. The actual dependencies are the private member variable type, and the constructor. If this class was a template for the internal stream type, or if it simply took a reference to an existing stream in the constructor, it could potentially wrap any stream type into a slice.

A possible template solution:

template <class InternalStream>
class SlicedStream {
  SlicedStream(uint64_t startingOffset, uint64_t sliceLength, /* Forwarded args */) :
    startingOffset(startingOffset),
    sliceLength(sliceLength),
    internalStream(/* Forwarded args */)
  {}

private:
  InternalStream internalStream;
};

A possibe interface reference solution:

class SlicedStream {
  SlicedStream(uint64_t startingOffset, uint64_t sliceLength, Stream& internalStream) :
    startingOffset(startingOffset),
    sliceLength(sliceLength),
    internalStream(internalStream)
  {}

private:
  Stream& internalStream;
};

The first has clear semantics concerning ownership of the internal stream. Since it uses containment, when the slice is destroyed, the constructed internal stream is also destroyed. A downside is it requires constructing a new internal stream to take a slice, which might not always be appropriate. How do you construct an object to represent the 3rd file in a solid .7z archive? This might be easier to accomplish if you had access to the internal stream to advance the state, or otherwise manipulate it into the correct starting state.

The second by-reference wrapping solution gives greater access and control of the internal stream, allow arbitrary setup before the slice object is created. However, it has less clear semantics concerning ownership. It's probably unexpected to call delete on the reference, and so the internal stream wouldn't be automatically cleaned up when the slice is destroyed. In practice though, it's likely a slice object would just be a local variable wrapping another local variable, so everything would work just fine. However, if you tried to new an object, or return one from a function by pointer or reference, the ownership of the internal stream could become an issue.


The above ownership issue also affects the MemoryStreamReader class. It does not own the internal buffer, but rather just borrows access to a pre-existing buffer by taking a pointer to it in the constructor. There have been a few cases in development where it might have been easier and quicker to get to a working solution (though sometimes sub-optimal) if we had a memory buffer stream that owned its own buffer, rather than being a simple view into it. In particular, returning a newly allocated MemoryStreamReader object with a newly allocated buffer can be a problem, as the buffer will be leaked memory.

StreamReader* GetStream() {
  char* buffer = new char[1024]; // Who will clean this up?
  Initialize(buffer);
  auto stream = new MemoryStreamReader(buffer, 1024);  // Not me, I'm just a simple view. My destructor is empty!
  //delete[] buffer; // Clearly this would be a bad idea. Last chance though
  return stream; 
}

The borrowed view also has benefits, in particular, it makes it dead simple to create a slice of the stream, since each slice is still using the same borrowed buffer, with no copying of buffer data required. More generally, it allows for easy re-use of the same buffer of data.

char buffer[1024]; // Local variable, automatically cleaned up on function exit
Initialize(buffer);

// Process data
MemoryStreamReader stream1(buffer, 1024); // Borrow buffer, creating view into it. Object is local variable
Header header = ParseHeader(stream1);
auto someData = ProcessChunk1(stream1.slice(header.chunk1Size)); // Easy slicing, without buffer copy

// Re-process data
MemoryStreamReader stream2(buffer, 1024); // Crate another view, again no buffer copy. Object is local variable
auto checksum = ChecksumStream(stream2);

The view offers higher performance solutions, while a class which owns a buffer might provide more convenience in some situations. In particular, views are often a better choice when passing data in to functions, while buffer ownership is often more convenient when returning data from functions.

For this reason, I've been tempted to rename the MemoryStreamReader to something like MemoryViewStreamReader, and adding an alternate MemoryBufferStreamReader. I'm open to better suggestions on the names. Also open to discussion on if this is useful enough to implement.

Note: A slice of a MemoryBufferStreamReader could be a MemoryViewStreamReader to avoid copying. We could also support a .Copy() operatoin, which for either memory stream class could return a MemoryBufferStreamReader.

DanRStevens commented 5 years ago

I wanted to formalize some additional thoughts on this.

First, there is the concept of Random Access Memory (RAM) objects. I think this goes beyond simply being able to seek and read or write. I think a proper RAM object combines seeking and reading into a single atomic operation. There is no seeking which is independent of a read or write.

Memory and disk both allow for random access. At least, that's true of disk access at lower layers within the OS. The OS might present both a sequential streaming read API (commonly used), as well as a concurrent random access model (higher performance). For example, in Windows, there is both a ReadFile and a ReadFileEx. The later makes use of an OVERLAPPED structure, which contains an offset. By using the later function, with an offset, there is no distinction between seeking and reading, its all one operation. This is similar to memcpy with an array of memory, which takes both a position and a length.


If you want to build a seekable stream, a fairly natural way is to combine a RAM object, with a stream position.

The MemoryStreamReader class is an example of combining a RAM object (the array buffer), with a stream position.

Along with this natural construction for a seekable stream, you also get really easy slicing. If you want a sliced stream you combine the RAM object of the parent stream with a new stream position.

In a previous reply above, I discussed ownership semantics. That is relevant here. In the case of the proposed MemoryViewStreamReader, the RAM object is shared between the parent and the slice. The lifetime of the slice is limited by the lifetime of the parent, since otherwise the referenced RAM object which they share would become a dangling pointer. For the proposed MemoryBufferStreamReader, the RAM object is copied from the parent, producing a new RAM object in the slice. The slice is then fully independent of the parent, and can have its own lifetime.


A RAM model plays well with concurrency. A stream model does not.

With a RAM model, there is no shared mutable state between separate read/write operations. In particular, there is no shared position. Each seek+read/write is atomic, and multiple concurrent calls will not interfere with each other. This is true even if multiple calls to the same object are made on separate threads.

For a stream, the stream position is a shared mutable state. The seek and the read/write are separate non-atomic operations, which both update shared state. Further, considering each read/write updates the stream position at the end, even a chain of read/write operations can be considered to have implicit embedded seeks in it. This makes the operations order dependent. If you were to interleave calls to the same object, you could be separating a read/write from a previous seek. For single threaded code, this could be done carefully, such that a seek is always done after an interleaved operation. For multithreaded code, there is no hope other than to throw synchonization primitives around any seek + chain of read/write operations.


A less natural way of producing a seekable stream is to use a data source which is already stream-like, and provides a seek operation which is independent from the read/write. This is what happened with FileStreamReader. The underlying ifstream object is stream-like, with independent seek and read. Because such sources have their own position, you likely delegate seeks to the underlying data source.

When slicing a stream object with a stream-like data source, you'll probably need to copy the underlying data source. This is why FileSliceReader had to duplicate/reopen the underlying ifstream object. If you instead try to slice using a shared data source, you run into challenges with concurrency. This comes from the parent and the slice sharing access to the same mutable state, the stream position. Interleaving calls on the parent and the slice, supposedly independent streams, could break apart paired seek + read/write operations (including chains of read/write operations) on the underlying stream-like data source.

An alternative design is to duplicate the stream position in the stream class, and seek before each read/write operation. Such a design is much less efficient, likely producing many more seek calls than are necessary, though it does offer some protection when slicing with a shared underlying data source. With such a design, you would be protected from concurrency problems when both the parent and the slice access the underlying data source within a single thread. For multithreaded code, you'd still need sychronization primitives to ensure the seek + read/write operation on the underlying stream are atomic. This could be done within the read/write operation of the stream class, so there are some conveniences gained there. However, synchronization primitives add overhead, and would slow down even the single threaded case.

In short, you probably still want to stick to copying the underlying data source if it is at all stream-like.


So that's the theory. I'm still working on what it should mean in practice.

For the MemoryStreamReader class, the buffer portion could be extracted to a RAM type Memory class.

I believe it would be possible to use templates to build seekable streams from a RAM object, including slice support. Hence the MemoryStreamReader and MemoryStreamWriter classes could become one-line instantiations of such a template. The same could be done for files if a RAM type file object existed.

It seems possible to write a RAM type object for File access. However, I'm not aware of any C/C++ standard library methods that allow for concurrent file access. I know such functions exist on Windows (overlapped I/O), and I've just found some stuff on Linux (aio = asynchronous I/O), though I really don't care to program at that level. It's too much effort to go that low, and it limits the platforms the code will compile on. Another option is to fake it with a wrapper class that presents a RAM like interface. That falls prey to some of the concurrency issues mentioned though.

Templates could also be used for non-RAM type data sources with copy constructors to implement streams with slicing. For non-RAM type data sources without copy constructors, perhaps it would still be possible using a bit of class specific customization. We should probably stay away from implementing non-RAM type data source classes that try to slice without copying.

Brett208 commented 5 years ago

It seems like the strongest appeal for the RAM type reader/writer would be allowing access to the same reader/writer by multiple threads simultaneously? This seems interesting, but beyond my experience. I have occasionally coded with multiple threads, but I would always limit reading/writing data to a single thread. For example, the primary UI thread would update a status bar while a background thread would load a file.

Are there other use cases beyond multi-threading?

I don't feel like I can contribute much to this part of the codebase due to lack of experience and knowledge on the subject. Would still be happy to read over any developed code though to look for errors and ease of readability.

DanRStevens commented 5 years ago

Yes, there are benefits beyond multi-threading. This relates to the coupling of stream position between parent and slice, and allows us to easily keep them independent. In particular, if you slice a stream, pass the slice to a function for processing, and then resume processing on the parent stream afterwards in the parent function, there should be no odd interactions between the stream positions of the slice and the parent, even though operations were interleaved between the two.