Open robla opened 3 years ago
Over on #6 there is some related discussion, where possibly a data line will begin with a digit, a comment with a #, a header line with an alpha character, and for special cases to begin with a another character possibly !
It would seem to be a consensus that the files should be utf8, with most input restricted to a limited character set except in delimited string blocks. I would further like to qualify that as visible characters and the standard space character.
It would seem to be a consensus that the files should be utf8, with most input restricted to a limited character set except in delimited string blocks. I would further like to qualify that as visible characters and the standard space character.
I agree (assuming you're also okay with the full range of visible UTF-8 characters in comments). Inside of the string blocks and comments, I don't know for sure which characters to restrict or how to restrict them. Outside of delimited blocks, it looks like what you're calling for is pretty common, given the definiton of "VCHAR" in the "Core Rules" of RFC 5234
HTAB = %x09
; horizontal tab
LF = %x0A
; linefeed
[...]
SP = %x20
VCHAR = %x21-7E
; visible (printing) characters
It appears as though HTAB
, LF
, SP
, and VCHAR
ought to be sufficient for structural syntax outside of square brackets and comments. Where can we find a definition of "visible UTF-8 character" that's useful to a developer looking at a bytestream?
Yes, of course comments should have the same flexibility as quoted strings. As an example, Perl has a regular expression class \p{XPosixGraph} for any character that is visible. But that's an example not a definition. A character with a defined visible representation, ie not a spacing character, control or unassigned character? We should probably also stipulate to ignore the CR character.
The BLUF: I would like to impose the following requirements on ABIF files.
Sorry for all of the acronyms and jargon, but I'll try to explain the rationale for all five of these below.
The details:
It seems to me that ABIF is taking on a structure that has general applicability to text-based file formats. It seems unwise to try inventing a new generalized text-based data structure format, since there already so many (RFC 822, XML, JSON, YAML, TOML, etc). However, it also seems to me that fear of "reinventing the wheel" (or rather, for the metaphor I'm making: "reinventing the hammer" as a tool) has led to misapplication of existing tools when a new tool may be more appropriate.
It may be that a text-based data format similar to (or the same as) what I'm suggesting here has been more adequately defined elsewhere. I am not hoping that what I'm defining here is new and unique, but as of this writing (in June 2021), I am not aware a generalized text-based data structure format similar to this.
ABIF is encoded using UTF-8, so let me interject with few essential bits of Unicode important to this conversation. "UTF-8" is a character-encoding format that encodes each character of a text document in a sequence of up to four bytes, but typically (for English) only one byte per character. This has become less true for English as support for UTF-8 has slowly replaced ASCII as the baseline character encoding format. The transition has been slow because UTF-8 is a stricty-compatible superset of lower ASCII. By "lower ASCII", I'm referring to the UTF-8 characters "
U+0000
" through "U+007F
" which are codepoints in the "Basic Latin Unicode Block". UTF-8's Basic Latin encoding is compatible with ASCII's encoding in that range of characters. The transition from ASCII-only to full UTF-8 has accelerated in recent years in no small part because "plain text" editors have gained support for characters beyond Basic Latin (like “fancy” quotation marks and emojis like “🐕” and “🧔”)Prior to the broad acceptance of UTF-8 as a method for encoding text, and prior to the broad acceptance of XML and JSON (and other text-based formats for data structures), it was common to use the "FourCC" byte sequence as a technique for defining the structure of the data following the sequence. "FourCC" stands for "four character code", but it is really a "four byte code" rather than four characters, and were typically restricted to ASCII bytes. I believe that FourCCs are in still in common use today in binary formats (e.g. .mp4 files and .webm files), but it's been been a while since I've looked at binary file formats very closely. Regardless, four-bytes is the same as 32 bits, which capable of expressing 4,294,967,296 values. I do not anticipate needing more than a dozen line types with ABIF, but if anyone else wishes to create text format using these ideas, it's something to keep in mind.
I've never been much of a C programmer or assembly-language programmer, but I believe I'd be able to write an efficient byte-level tokenizer for ABIF files where each line conformed to the following quasi-BNF (where "BNF" is "Backus-Naur Form")
<line>
<LF>
") (or optionally "<CR><LF>
"). I believe the BNF production looks something like this: "<One2FourBC> <LSD> (<CR>)? <LF>
"<One2FourBC>
<124BC>
", but let's not make that change yet. The "<One2FourBC>
" code may contain line-specific data to prepend to the following "<LSD>
"<LSD>
<One2FourBC>
.<CR>
U+000D
" -- The "carriage return" character in the Basic Latin Unicode Block.<LF>
U+000A
" - The "line feed" character from the Basic Latin Unicode Block. The minimum number of bytes for newline in a modern text file.For each "One2FourBC" we define, we are going to need to create a BNF specification for that line. Creating a BNF is not that hard, and in fact, we should be able to test our BNFs using BNF parsers like the Python-based SimpleParse. But we also shouldn't relish the idea of creating a lot of line formats, because we need to keep ABIF simple enough to be readable by non-developers (as well as developers who don't want to implement overly-complicated text-formats).
The way that I see ABIF evolving is that we will have different tiers of data that people will want to pull out of the file:
>
" and "=
" delimiters between candidates in ballot bundles (rather than ",
") and strongly encourage implementors to list candidates in order of most preferred to least preferred within each ballot bundle. Regardless, for ABIF to be successful, we will need to determine which domains the most important ones to serve for ABIFv1.0.Anyway, that's a lot to consider, but I still have one other thing to discuss. One thing that I've come to realize about many popular text-based serialization formats: the identifiers didn't start out as first class citizens. Within XML and JSON, it seems that identifiers were bolted on at the end of the specification process. The mechanism that I proposed for candidate identifiers in ABIF issue #8 seem like a general purpose mechanism for all ABIF-like formats. Here's an example of the markup:
It seems to me that the format should treat a line of this format to be an "identifier" for all sorts of purposes. I don't know of anything other than polticians that would need identifiers in ABIF, but it seems to me that the BNF production for identifiers should be similar to (or perhaps the same as) that of XML identifiers (like the "
Name
" production out of the original XML specification from 1998)I think all bare, unquoted identifiers in ABIF should start with an ASCII letter (or maybe an underscore, but probably not a colon). What happens after the first character can be more flexible, but probably not as flexible as XML.
Anyway, that's a lot of words to get to my "BLUF" above. Restating the bullet points I led with:
<One2FourBC>
") should be enough to identify which BNF production is being processed. Speaking of BNFs...Are these five requirements good requirements for ABIF? Please let me know!