Matroska-Org / libebml

a C++ library to parse EBML files
GNU Lesser General Public License v2.1
140 stars 47 forks source link

handle non-mandatory Empty Elements with default values correctly #72

Open mbunkus opened 3 years ago

mbunkus commented 3 years ago

In section "6.1. Data Size Format" the EBML RFC states:

If an Empty Element has a default value declared, then the EBML Reader MUST interpret the value of the Empty Element as the default value.

At the moment this simply isn't done. For example, the EbmlUInteger::ReadData function always uses 0 if it's an Empty Element. This works for elements whose default value is 0, of course, but not for others.

BTW: the EbmlUInteger::RenderData function doesn't take default values into account either. To be honest, I prefer it that way as it will be less confusing to not fully spec-compliant EBML readers (such as libebml in its current state).

mbunkus commented 3 years ago

EbmlUInteger was just an example. All the other data type classes don't handle that case either.

robUx4 commented 3 years ago

I can confirm this bug. Luckily it is fixable without changing the ABI. There is DefaultISset() to tell if an element has a default value. And when the size is zero we should use that default value DefaultValue instead of 0/"".

It seems libebml2 (used in mkclean and mkvalidator) has the same issue with a twist, it's possible to set elements to their default value when they are created. Except it's not used...

robUx4 commented 3 years ago

Matroska elements affected by this bug:

Elements with a default value of 0 are not affected.

robUx4 commented 3 years ago

BTW, are there files existing with a Length of 0 (to save space) ? Does mkvmerge use this feature for example ? It doesn't seem possible with libebml, even though the EbmlElement::SizeLength variable is initialized to 0. That variable is used exclusively with CodedSizeLength(), even in libmatroska.

In that function it's used as

  if (SizeLength > 0 && CodedSize < SizeLength) {
    // defined size
    CodedSize = SizeLength;
  }

Where CodedSize is always 1 and 5. So a EbmlElement::SizeLength value of 0 has no effect here. And the function never returns 0.

mbunkus commented 3 years ago

mkvmerge does not use that feature. I don't know if there's any implementation that does. I'm not even sure if the original specs even say what value to use in this case. My guess is that all existing implementations simply return 0 for numeric types & an empty string for string types. Just like libebml.

I haven't seen a single file using this shortcut for elements-with-default, at least not consciously. I have very vague memory of seeing files which wanted to store 0 / the empty string, though I really don't have any more details on that.

robUx4 commented 3 years ago

Calling out @dericed who has a huge corpus of MKV files he can check. @mkver do you know if libavformat uses the "zero length" feature when muxing ?

Our website never explained how it's supposed to be used, nor the dedicated libebml/EBML website.

mkver commented 3 years ago
  1. The original specs disallow writing integers and floats on zero bytes:

The known basic types are:

This is good as it means that this new feature can't break old valid files (at least for integers and floats).

  1. libavformat doesn't write floats or integers on zero bytes. But it might write a zero-length element for a string/utf-8 string with strlen == 0 or for a binary element of size zero. Our demuxer btw also treats zero-length integers as having the value zero. I intend to work on this in the coming days.
  2. I might be the reason for mbunkus' vague memory: I wanted to (and I still want to) create attachments with an empty (i.e. strlen == 0) filename if the source does not provide a filename because otherwise remuxing attached pictures from containers that don't have the concept of a filename for attached pictures fails. The relevant patchset is 95% finished, but for about a year I am too lazy/busy to finish it.
robUx4 commented 3 years ago

So that should leave integers out of this issue. We can assume they were never written with a 0 length. It may not be the case for all muxers but they're probably not read correctly with well known demuxers.

That leaves \Segment\Tracks\TrackEntry\Language as the only element that can potentially have issues. I know in VLC we assign the default value to the variable we read it from. If the size is 0 that value wouldn't change. In libavformat it seems to be similar. The default value is first assigned and then the actual data read. This element being such an important element I can hardly imagine a demuxer would fail to use its default value properly when reading, especially since in many cases the value is not written. And it's highly unlikely a length of 0 would be used.