MPEGGroup / FileFormat

MPEG file format discussions
23 stars 0 forks source link

Clarify the validity of `utf8string` with multiple NUL characters #35

Closed baumanj closed 1 year ago

baumanj commented 3 years ago

Based on this text from ISOBMFF (ISO/IEC 14496-12:2020) § 4.2.1:

In these definitions, null-terminated means that the last character of a string is Unicode NUL, and hence an empty string is represented by a single Unicode NUL.

It's not entirely clear what to do with a value containing multiple NUL characters.

Consider the following, valid hdlr box:

$ hexdump -C -s 44 -n 40 ~/Pictures/green.avifenc.avif 
0000002c  00 00 00 28 68 64 6c 72  00 00 00 00 00 00 00 00  |...(hdlr........|
0000003c  70 69 63 74 00 00 00 00  00 00 00 00 00 00 00 00  |pict............|
0000004c  6c 69 62 61 76 69 66 00                           |libavif.|

If the "a" in "libavif" were replaced with NUL, would it still be valid? And if so, would the value of the name field of the hdlr box correctly be "lib" or "lib\0vif"?

cconcolato commented 3 years ago

I think the first use of NUL ends the field, otherwise it would be ambiguous. Consider the mett box:

class TextMetaDataSampleEntry() extends MetaDataSampleEntry ('mett') {
utf8string content_encoding; // optional
utf8string mime_format;
TextConfigBox (); // optional
}

if what is encoded is a\0b\0c\0, how do you know what is what?

I think we should replace:

null-terminated means that the last character of a string is Unicode NUL

with

null-terminated means that the first Unicode NUL character terminates the string

dwsinger commented 2 years ago

change to say that the string is terminated by the first NUL character and there shall be at least one such.

cconcolato commented 1 year ago

This is clarified in the 8th edition. https://dms.mpeg.expert/doc_end_user/documents/140_Mainz/wg11/MDS21996_WG03_N00717.zip