OutpostUniverse / OP2Utility

C++ library for working with Outpost 2 related files and tasks.
MIT License
4 stars 0 forks source link

Allow reading a null terminated string from stream when length is unknown #311

Closed Brett208 closed 4 years ago

Brett208 commented 4 years ago

Sometimes strings are stored without providing the length. Sometimes stored strings may either be null terminated or just use the maximum allowed indices without null terminated.

An example would be the SectionTable::Name field from the PE format.

An 8-byte, null-padded UTF-8 encoded string. If the string is exactly 8 characters long, there is no terminating null.

I propose adding something similar to below code to stream::reader:

// Reads a string until a null terminator is encountered or maxCount is reached
std::string ReadNullTerminatedString(std::size_t maxCount = SIZE_MAX)
{
    std::string str;

    char c;
    for (std::size_t i = 0; i < maxCount; ++i)
    { 
        stream->Read(c);
        if (c == '\0') {
            break;
        }

        str.push_back(c);
    }

    return str;
}
DanRStevens commented 4 years ago

This may have merit in it's own right, however...

I would argue that such a method would not be needed for the example of the section table name in PE files. The documentation suggests the field is a fixed 8-byte field. If the string is shorter than 8 bytes, it will be padded with nulls up to 8-bytes. As such, it can simply be read as an 8-byte field. When converting to an std::string or similar, the result can be truncated early at the first null byte.

In terms of efficiency, for such a small maximum size, it would likely be most efficient to read all data into a buffer first, and then scan for any nulls. Multiple virtual calls to Read, which then may need to delegate to the OS, for each character, would not be optimal.

When reading fixed length fields which are padded up to their maximum length, you'll probably want a method that advances the file pointer by a fixed amount, equal to the padded size. That would make it easier and safer to continue reading any following fields. Without such built in behaviour, the caller would need to seek over the padding, which varies in sized based on the string size. Failing to do so, and trying to read subsequent fields would result in corrupt data.

Now, that's not to say there is no merit in adding support for reading null terminated strings of unknown length. There obviously is. Just that in the case of the example given here, the length (of the read) is actually known.

Brett208 commented 4 years ago

Correct that I just loaded the 8 chars into memory for SectionTable::Name. This code is lifted from loading the export string table from the PE file into memory which may be a more agreeable application to you.

It sounds like you support the idea, so I may take the time to push it into OP2Utility and pull it out of MissionScanner.

-Brett

DanRStevens commented 4 years ago

Ahh, I see now. The two uses of your new method were for other purposes. One for the name table, and one for the values of data exports. For the name table, the table as a whole is of a known size, so could have been read into memory as one chunk, and then parsed into strings. The data exports however are pretty free format, with no way to predict their length.

The one note I have here is that reading character by character is not particularly efficient. If the stream is seekable, you can potentially fill a small buffer in stages, until a null byte is found, and then seek to adjust the stream position to be after the null byte. If the stream is not seekable, you can't really avoid the character by character inefficiency, unless there is a way to peek at the unread data. That is sometimes possible for buffered streams.

I'm wondering if this can be implemented in a way that allows for greater efficiency than character by character reading, while also being reasonably general. Maybe different implementations depending on the stream type? Or restricted to more able streams? I suppose we can start with a general implementation that works anywhere, and then provide overrides later on for more efficient use of more capable streams.

Brett208 commented 4 years ago

Maybe the seekable reader could override the base reader's version to allow optimization by reading multiple chars in at once and then backing the string up to the null terminator. In that case, it may also be wise to add a variable that represents the mean upper size of the expected string based on the application.

For example you could probably predict the avg size of an exported variable name.

I'm not really interested in doing this optimization right now, partly because we don't have a proper stopwatch in the library. The one I proposed in the other branch works well for long cycles like executing a console program but I'm not sure it would be useful on this scale.

Maybe this issue can be to write the non-seekable version and a new issue can be opened if you think the seekable version is worth pursuing in the near future.

DanRStevens commented 4 years ago

I think a generic base class method would do well for now.

If we want to optimize for the subclass, it can override the base class method. If we want to optimize for buffer size, we can perhaps create a template method override, which takes the buffer size as a template parameter. For ease of use, we should probably set a default buffer size.

We can split optimization work into a new issue. (Assuming we plan to eventually do the work).