FourCCs, UTF-8, and ABIF files

robla commented 3 years ago

The BLUF: I would like to impose the following requirements on ABIF files.

All ABIF files are valid UTF-8 byte streams
It should be possible to identify the line type of each line of an ABIF file by inspecting the first four bytes of the line (the "One2FourBC", as I'd like to call it)
Each ABIF line type will have a structure that MUST be defined by a self-contained BNF
All ABIF line types can be identified by the first byte (or maybe two, or MAYBE even four, but absolutely no more than four). We should strive to keep it that way.
Identifier tokens (e.g. "bare candidate tokens" in the current nominclature) should be first-class citizens inside an ABIF file.

Sorry for all of the acronyms and jargon, but I'll try to explain the rationale for all five of these below.

The details:

It seems to me that ABIF is taking on a structure that has general applicability to text-based file formats. It seems unwise to try inventing a new generalized text-based data structure format, since there already so many (RFC 822, XML, JSON, YAML, TOML, etc). However, it also seems to me that fear of "reinventing the wheel" (or rather, for the metaphor I'm making: "reinventing the hammer" as a tool) has led to misapplication of existing tools when a new tool may be more appropriate.

It may be that a text-based data format similar to (or the same as) what I'm suggesting here has been more adequately defined elsewhere. I am not hoping that what I'm defining here is new and unique, but as of this writing (in June 2021), I am not aware a generalized text-based data structure format similar to this.

ABIF is encoded using UTF-8, so let me interject with few essential bits of Unicode important to this conversation. "UTF-8" is a character-encoding format that encodes each character of a text document in a sequence of up to four bytes, but typically (for English) only one byte per character. This has become less true for English as support for UTF-8 has slowly replaced ASCII as the baseline character encoding format. The transition has been slow because UTF-8 is a stricty-compatible superset of lower ASCII. By "lower ASCII", I'm referring to the UTF-8 characters "U+0000" through "U+007F" which are codepoints in the "Basic Latin Unicode Block". UTF-8's Basic Latin encoding is compatible with ASCII's encoding in that range of characters. The transition from ASCII-only to full UTF-8 has accelerated in recent years in no small part because "plain text" editors have gained support for characters beyond Basic Latin (like “fancy” quotation marks and emojis like “🐕” and “🧔”)

Prior to the broad acceptance of UTF-8 as a method for encoding text, and prior to the broad acceptance of XML and JSON (and other text-based formats for data structures), it was common to use the "FourCC" byte sequence as a technique for defining the structure of the data following the sequence. "FourCC" stands for "four character code", but it is really a "four byte code" rather than four characters, and were typically restricted to ASCII bytes. I believe that FourCCs are in still in common use today in binary formats (e.g. .mp4 files and .webm files), but it's been been a while since I've looked at binary file formats very closely. Regardless, four-bytes is the same as 32 bits, which capable of expressing 4,294,967,296 values. I do not anticipate needing more than a dozen line types with ABIF, but if anyone else wishes to create text format using these ideas, it's something to keep in mind.

I've never been much of a C programmer or assembly-language programmer, but I believe I'd be able to write an efficient byte-level tokenizer for ABIF files where each line conformed to the following quasi-BNF (where "BNF" is "Backus-Naur Form")

expression	meaning
`<line>`	the full line, terminated by a line feed character ("`<LF>`") (or optionally "`<CR><LF>`"). I believe the BNF production looks something like this: "`<One2FourBC> <LSD> (<CR>)? <LF>`"
`<One2FourBC>`	Short for "one to four byte code" (similar to "FourCC"s in other formats). This code comprises between one to four bytes which identifies the line type. We may decide to abbreviate this "`<124BC>`", but let's not make that change yet. The "`<One2FourBC>`" code may contain line-specific data to prepend to the following "`<LSD>`"
`<LSD>`	Whoa, man! That's trippy! FAR OUT!!!!1!!1! 🤪 Okay, just kidding. This refers to "line-specific data". The structure of the line-specific data depends on the contents of the `<One2FourBC>`.
`<CR>`	"`U+000D`" -- The "carriage return" character in the Basic Latin Unicode Block.
`<LF>`	"`U+000A`" - The "line feed" character from the Basic Latin Unicode Block. The minimum number of bytes for newline in a modern text file.

For each "One2FourBC" we define, we are going to need to create a BNF specification for that line. Creating a BNF is not that hard, and in fact, we should be able to test our BNFs using BNF parsers like the Python-based SimpleParse. But we also shouldn't relish the idea of creating a lot of line formats, because we need to keep ABIF simple enough to be readable by non-developers (as well as developers who don't want to implement overly-complicated text-formats).

The way that I see ABIF evolving is that we will have different tiers of data that people will want to pull out of the file:

Structural data - this is structural information that all ABIF implementations will need to deal with (e.g. identifiers, delimiters, newlines). It should be possible to know exactly how many ballots are in an ABIF file using an implementation that only has implemented support for the structural data. Identifiers should count as "structural data" (more on this in a bit).
Broadly-applicable application data - this is data that almost all ABIF implementations will be interested in. For example, it should be possible for applications to generate a list of UTF-8-formatted candidate names and the quantity of ballots that the candidate appears on by having implemented support for reading (and understanding) broadly-applicable application data.
Domain-specific essential application data - this is data that a significant number of implementors require (e.g. half of all ABIF implementors), and may also be of little interest ot implementors outside the domain of use for the ABIF file. For example, a "STAR voting" implementation may only be interested in the ratings given to each candidate, and may not care about the order that the candidates are presented in each ballot bundle. An "Instant-Runoff Voting (IRV)" implementation may only care about the ordering of the candidates in the ballot bundles, but cannot include a rating (because the voter may not have provided one). For interoperability and readability purposes, we may want to strongly encourage implementors to use the ">" and "=" delimiters between candidates in ballot bundles (rather than ",") and strongly encourage implementors to list candidates in order of most preferred to least preferred within each ballot bundle. Regardless, for ABIF to be successful, we will need to determine which domains the most important ones to serve for ABIFv1.0.
Niche-implementation application data - this is application data that is essential to a small number of implementations, but is not important to the vast majority of applications that support the format. For example, a voting machine may support images associated with the candidates, and may wish to use ABIF as the format to display the candidate list for voters. We could add a "CandidateIMG" field and associate it with the candidate identifier.

Anyway, that's a lot to consider, but I still have one other thing to discuss. One thing that I've come to realize about many popular text-based serialization formats: the identifiers didn't start out as first class citizens. Within XML and JSON, it seems that identifiers were bolted on at the end of the specification process. The mechanism that I proposed for candidate identifiers in ABIF issue #8 seem like a general purpose mechanism for all ABIF-like formats. Here's an example of the markup:

=DGM:[Doña García Márquez] # see marquez2024.com for candidate website
=SBJ:[Steven B. Jensen]    # dropped out of race three days prior to election
=SY:[Sue Ye (蘇業)]         #  see sueye.org/2024 for more
=AM:[Adam Muñoz]           #  see munozftw.org for more

It seems to me that the format should treat a line of this format to be an "identifier" for all sorts of purposes. I don't know of anything other than polticians that would need identifiers in ABIF, but it seems to me that the BNF production for identifiers should be similar to (or perhaps the same as) that of XML identifiers (like the "Name" production out of the original XML specification from 1998)

[4] NameChar ::= Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar | Extender
[5] Name ::= (Letter | '_' | ':') (NameChar)*

I think all bare, unquoted identifiers in ABIF should start with an ASCII letter (or maybe an underscore, but probably not a colon). What happens after the first character can be more flexible, but probably not as flexible as XML.

Anyway, that's a lot of words to get to my "BLUF" above. Restating the bullet points I led with:

I believe that ensuring that all ABIF files are valid UTF-8 byte streams is more difficult than it appears, but we need to do it so that we can play cards with ABIF files (🂡🂵🃘🃙🃟...😝). More importantly, so that Doña García Márquez can run for office using her real name.
The FourCC at the beginning of each line (a.k.a. the "<One2FourBC>") should be enough to identify which BNF production is being processed. Speaking of BNFs...
Each ABIF line type will have a structure that MUST be defined by a self-contained BNF. This will make it so that each line can be parsed independently (e.g. so that an NDJSON line can be passed to a JSON parser)
As identified in the second bullet, all ABIF line types need to be identified by the first byte (or maybe two, or MAYBE even four, but absolutely no more than four). This makes it easier for implementations that are handling ABIF streams byte-by-byte, such as implementations written in C for performance reasons.
Identifier tokens should be first-class citizens inside an ABIF file or any text-based data format. They are part of the "structural data" of the file.

Are these five requirements good requirements for ABIF? Please let me know!

brainbuz commented 3 years ago

Over on #6 there is some related discussion, where possibly a data line will begin with a digit, a comment with a #, a header line with an alpha character, and for special cases to begin with a another character possibly !

It would seem to be a consensus that the files should be utf8, with most input restricted to a limited character set except in delimited string blocks. I would further like to qualify that as visible characters and the standard space character.

robla commented 3 years ago

It would seem to be a consensus that the files should be utf8, with most input restricted to a limited character set except in delimited string blocks. I would further like to qualify that as visible characters and the standard space character.

I agree (assuming you're also okay with the full range of visible UTF-8 characters in comments). Inside of the string blocks and comments, I don't know for sure which characters to restrict or how to restrict them. Outside of delimited blocks, it looks like what you're calling for is pretty common, given the definiton of "VCHAR" in the "Core Rules" of RFC 5234

        HTAB           =  %x09
                                ; horizontal tab
         LF             =  %x0A
                                ; linefeed
[...]
         SP             =  %x20
         VCHAR          =  %x21-7E
                                ; visible (printing) characters

It appears as though HTAB, LF, SP, and VCHAR ought to be sufficient for structural syntax outside of square brackets and comments. Where can we find a definition of "visible UTF-8 character" that's useful to a developer looking at a bytestream?

brainbuz commented 3 years ago

Yes, of course comments should have the same flexibility as quoted strings. As an example, Perl has a regular expression class \p{XPosixGraph} for any character that is visible. But that's an example not a definition. A character with a defined visible representation, ie not a spacing character, control or unassigned character? We should probably also stipulate to ignore the CR character.

electorama / abif

FourCCs, UTF-8, and ABIF files #9