Matroska-Org / libebml

a C++ library to parse EBML files
GNU Lesser General Public License v2.1
140 stars 47 forks source link

[RFC] keep track of minver and maxver per element #207

Closed robUx4 closed 9 months ago

robUx4 commented 10 months ago

They correspond to the attributes in the EBML specs.

For they are just informational. We may use this information when reading data to find which element are allowed given the read/version in the EBML header, that can help with error recovery, skipping elements that are not legit given the context.

robUx4 commented 10 months ago

Like I wrote elsewhere, I'm not a big fan of the libraries constraining the elements that can be read or written depending on the version. The application should deal with it. Or there should be a switch to turn such a feature off & let the application deal with it. My legitimate use case is writing the EBML header with rather low version numbers & updating it later on when elements requiring newer versions to be written.

Good because that's also my opinion. In VLC we will read invalid elements no matter what.

The API allow to check which elements are valid and maybe skip them if they are invalid. It will probably something like ShouldWrite but for... reading. The host app will be in total control of what to read (mkvalidator could report when a illegal element is read, for example) but the default will be the same as now.

mbunkus commented 10 months ago

As I'm confused with the logic in the other PRs wrt. deprecated elements, I thought it would be good to have a function bool IsDeprecated() const and/or bool IsDeprecatedInVersion(num_version version) const directly in EbmlDocVersion. In that function you could add a comment describing how deprecation is expressed with the existing values. Other places would not need to know about the internals & can just call the function.

Additionally functions such as const IsValidInVersion(num_version version) const, num_version GetMinVersion() const, num_version GetMaxVersion() would be nice to have.

The goal is to hide all the internals of EbmlDocVersion from users of the class. Leave how values are stored/used up to the class, don't bother others with it.

And a last nitpick: I'm not the biggest fan of using unsigned char here. Member variables will be aligned to sensible boundaries (32bit?) for faster access on modern CPUs anyway. Just make that unsigned int.

Maybe also rename num_version to version_type to make it optically much clearer that this is a type. Or even value_type; would also match how things are named in the standard library (e.g. std::vector<T>::value_type).

robUx4 commented 10 months ago

Are you sure the member variables are usually aligned ? When I compile with AppleClang 15 this test passes in debug and release build:

  static_assert(sizeof(EbmlDocVersion) == 2, "nope");

I think since there's no bigger type, it keeps the smallest alignment which is 1 octet here.

It's all about the balance between the size in memory and the speed of the code. For 200 elements, that would add 1200 bytes in memory with code a little bit faster. Also libmatroska will subclass this class with 2 boolean which are also 1 octet each. That means the size would go to 32 octets instead of 4 octets if we force int in EbmlDocVersion, which means an extra 5600 octets used in memory.

IMO unless the code is much faster it's not worth the extra memory.

robUx4 commented 10 months ago

One reason to use a bigger storage size would be to avoid the 256 doctype version limit. It's clearly not on the horizon for Matroska. But maybe other EBML formats might need it.

mbunkus commented 10 months ago

Are you sure the member variables are usually aligned ?

Absolutely. That's why you need things such as __attribute__((__packed__)) or the MSVC equivalent of #pragma pack(push, 1) if you want a struct whose memory layout exactly represents something, e.g. structures in a file format.

It's quite possible that sizeof(EbmlDocVersion) is 2, but when you use an EbmlDocVersion inside another class then the following member variables in that embedding class will be aligned.

Here's an example that shows both effects:

[0 mosu@sweet-chili ~/test] cat cpp1.cpp
#include <cstdint>
#include <iostream>

struct s1 {
  uint32_t a;
  unsigned char b;
  uint32_t c;
};

struct __attribute__((__packed__)) s2 {
  uint32_t a;
  unsigned char b;
  uint32_t c;
};

struct embedding {
  s1 a;
  s2 b;
  s1 c;
};

int
main() {
  std::cout << "s1: " << sizeof(s1) << " s2: " << sizeof(s2) << " embedding: " << sizeof(embedding) << "\n";
  return 0;
}
[0 mosu@sweet-chili ~/test] ./cpp1
s1: 12 s2: 9 embedding: 36

One interesting fact about why these values are aligned is that a lot of CPUs cannot even read 32bit values from non-aligned addresses (they might end the process with a SIGBUS), or only with huge performance penalties.

It's a thing, this alignment.

robUx4 commented 9 months ago

I'm familiar with the alignment, but that's not exactly what is happening here. As said above the current size of EbmlDocVersion is 2. I did the following class:

class VersionExt : public EbmlDocVersion
{
    public:
        unsigned int val;
};

If the padding of the parent class was overriden, we would have a sizeof of 12. But the sizeof is 8. So the original class does keep its size of 2, the 4 of the unsigned int are added, and there's a padding of 2 to round to a power of 2.

Also this gives a sizeof 12:

class VersionExt : public EbmlDocVersion
{
    public:
        unsigned int val;
        unsigned char a;
};

When the same fields in a different order gives 8:

class VersionExt : public EbmlDocVersion
{
    public:
        unsigned char a;
        unsigned int val;
};

As we're changing the ABI that's something we should keep in mind as we may reorder elements to save on memory at no cost.

I'll switch to unsigned int, not for memory alignment/performance but simply because it allows more values.

mbunkus commented 9 months ago

LOL, Github thought I had merged this one when I pushed the libEBML branch… anyway, both are merged & pushed.