serialize19::ReadArchive: how to catch memory read access violation

fred777 commented 1 year ago

I'm using serialize19 to create and load binary files of structured data.

But if the file format changes, e.g. because data members have been added / removed, or if the file gets corrupted in some way, the archive reader might cause a read access exception which can't be caught without messing with signal handlers.

=> is there a clean and reliable way to detect and handle this condition without application crash?

#include "serialize19/ReadArchive.h"
#include "serialize19/dynamicWrite.h"
#include "serialize19/serialize.std_string.h"

struct OldFileFormat
{
    int a{};
    int b{};
    std::string s{};
};

struct NewFileFormat
{
    int a{};
    int b{};
    int c{};
    std::string s{};
};

template<class A>
void serialize(A& a, OldFileFormat& x)
{
    serialize(a, x.a);
    serialize(a, x.b);
    serialize(a, x.s);
}

template<class A>
void serialize(A& a, NewFileFormat& x)
{
    serialize(a, x.a);
    serialize(a, x.b);
    serialize(a, x.c);
    serialize(a, x.s);
}

void main()
{
    OldFileFormat old_data{1, 2, "test"};

    const auto buffer_old = serialize19::dynamicWrite(old_data); 

    // the buffer contents usually get saved into a binary file:
    // std::ofstream fout{filename, std::ios::binary};
    // fout.write(reinterpret_cast<const char*>(slice.begin()), slice.count());
    //
    // ... and later on we read the file back into a buffer slice via
    // std::ifstream fin{filename, std::ios::binary};
    // const auto file_buffer = std::vector<uint8_t>{std::istreambuf_iterator(fin), {}}
    // const auto file_buffer_slize = serialize19::BufferSlice{file_buffer.data(), file_buffer.size()});
    //
    // here we just skip this step and hand over the already existing buffer slice 

    auto reader = serialize19::ReadArchive{buffer_old.slice()};
    NewFileFormat new_data;

    try
    {
        serialize(reader, new_data);
    }
    catch (const std::exception& ex)
    {
        // We want to catch all file read exceptions in a clean way but there is only a hardware exception being thrown which requires dirty signal handler
    }
}

arBmind commented 1 year ago

Hi @fred777, good to hear that you are using serialize19. Serialize19 was designed to be minimal and extendable. So we only pay for what we actually use. Only the data you serialize is stored in the binary.

We encode the version of the data schema in a header and use the correct serialization method for that version. We basically hash our data layout. But this is far too complicated for this library.

You may store a version indicator in your format and decide which parts to load. Example:

struct Format {
  int a{};
  int b{};
  int c{};
  std::string s{};
};
template<class A>
void serialize(A& a, Format& x)
{
    uint8_t version = 2;
    serialize(a, version);
    serialize(a, x.a);
    serialize(a, x.b);
    if (version == 2) serialize(a, x.c);
    serialize(a, x.s);
}

I would also recommend to wrap serialize19 to tweak it better for your use case. For example use an operator instead of calling serialize manually.

arBmind commented 1 year ago

Regarding the exceptions. We try to avoid them. serialize assumes that the buffer has enough capacity to read or write the data. If you cannot ensure that you can wrap the basic archive types and throw your exception if the size is not enough. The ReadArchive has two interesting methods you might want to guard.

withPrimitive is mandatory to guard. SliceReader would use a reinterpret_cast beyond the buffer.
withSlice will return a slice smaller than the requested n if not enough data is available.

arBmind commented 1 year ago

While reading the code again, I guess we should add more documenting comments to this sublibrary.

fred777 commented 1 year ago

Hi @arBmind, thank you for the instant reply!

Yes, versioning is one mitigation I was thinking about. But this won't protect me from crashes caused by file corruption of course.

Another solution that just came to my mind could be similar to the SizeArchive: how about an archive reader that just probes the buffer to check if there is enough bytes allocated in it - lets call it SizeVerificationArchive

Then it's up to the user to use that SizeVerificationArchive prior to ReadArchive or not.

arBmind commented 1 year ago

Hi @fred777, Your SizeVerificationArchive is not a bad idea. My gut feeling says that this is not enough and therefore kind of dangerous. Reading files or network traffic is like reading user input. It can be broken or even come from an attacker. An serialization library cannot protect against this. We simple have not enough domain knowledge.

I have an extension to serialize19 in one project that looks like this:

enum class ReadResult {
  Success,
  VersionMismatch, // eg. unknown version
  LengthMismatch, // eg. serialized size != expected size
  SizeMismatch, // eg. buffer has not enough or too much data
  ConstraintViolation, // eg. enum value out of range
};

struct ReadArchive : serialize19::ReadArchive<endianBehaviour()>{
  ReadResult result{};
};

I hope that inspires you to go further.

I would appreciate and accept a pull request for a dedicated SizeVerificationArchive.

fred777 commented 1 year ago

Pull request - maybe later - I just found a solution that works for my purposes: enable SEH exceptions in VC :) Now I'm catching either a std::bad_alloc or something generic via catch(...), depending on the type of buffer corruption.

But regarding user input / attacks, the proposed buffer size checking would at least provide a first line of defence to prevent application crashes.

At a second stage, validity of all data members needs to be verified of course.

basicpp17 / co-cpp19

serialize19::ReadArchive: how to catch memory read access violation #6