Define a core data model for ABIF

robla commented 3 years ago

Many text formats aspire to simplicity, with the belief that data models are an "implementation detail". My inclination is to err in that direction, because I fear that trying to start discussion by agreeing on a serialized data model leads to this series of unfortunate reasoning:

Let's agree on a data model before we agree on syntax
Great, we have a data model, how do we serialize it?
Why invent another data serialization format; why don't we use something like JSON or XML?
Result: a large, complicated data hierarchy that is difficult/impossible to author with a text editor, and difficult to spot errors with human inspection.

Having seen the development of many "Document Object Models (DOMs)" over the years (including working closely with the folks defining a document object model for MediaWiki markup), I've been hesitant to tackle such a complicated issue so early in the development of a new format that seems so clear in my mind. However, I've come to realize that my ideas about the data that is "important" (or "interesting" to me) and the data that is "unimportant" (or "uninteresting" to me) may be very important to others, and I want to build consensus around my idea of what ABIF can be. After mulling over the discussions in several issues here (particularly issues #6 and #14 regarding the metadata format), it occurs to me that a core data model may be helpful.

Here's my take on a core data structure that ABIF files should resolve to, expressed as a partial JSON file (NOTE: this comment is subject to revision):

{
    "metadata":
    [
        {
            <key-1>: <value-1>,
            <key-2>: <value-2>,
            <key-3>: <value-3>,
            ...
            <key-n>: <value-n>
        }
    ],
    "candidates":
    [
        {
            <candidate-id-1>: <candidate-information-1>,
            <candidate-id-2>: <candidate-information-2>,
            <candidate-id-3>: <candidate-information-3>,
            ...
            <candidate-id-n>: <candidate-information-n>
        }
    ],
    "ballot_bundles":
    [
        {
            <ballot-bundle-id-1>: <ballot-bundle-1>,
            <ballot-bundle-id-2>: <ballot-bundle-2>,
            <ballot-bundle-id-3>: <ballot-bundle-3>,
            ...
            <ballot-bundle-id-n>: <ballot-bundle-n>
        }
    [

Expressing this as JSON is tricky, because JSON dictionaries are unordered key-value pairs, and there's not a great way to stipulate "order matters!". Moreover, I would like to make sure it's possible to build the data structure above using a single-pass parser. That's going to have all sorts of really tricky implications. I think we can pull it off if we have keep a shared data model in mind, but we're going to have to do things that make people who love beautiful context-free grammars (CFGs) cringe.

brainbuz commented 3 years ago

Are all the layers needed? The key value pairs for metadata and candidates don't need to be wrapped in array, the ballots can be left as an array and not have ids generated for them.

Assuming that there is an optional header, optionally one or more lists, then a data section, I think the data structure will remain fairly simple. Stick with the simplest data structure that can represent the data. in JSON, an object for each section, containing another object or array as appropriate to that section.

The most complex part is converting the ballot lines to a data structure, the abif2json utility does not need to be designed in the spec.

robla commented 3 years ago

I think a data model is going to be needed for a proper test suite. I'm in the process of writing an Lark implementation of an ABIF validator now, which is going slower than I hoped because I'm learning how to use Lark at the same time I'm writing the validator. The validator is going to be based on my mental model of ABIF. My mental model is best described by what I described in this issue.

My plan is to make my validator optionally output the data modeI, so that I can use the validator to build up a test suite for ABIF. n many ways, this is the serialization that @brainbuz was asking for in issue #4. I've come to realize that the specification efforts I was involved in many years ago centered around the data model, and many of the ABIF issues that have been filed have centered around differing data models. I still believe that the data structure used by the software implementing ABIF is not an important for interoperability. However, the "model" starts implying aspects of the API used, and over time, the model will start becoming more important than the exact syntax.

Like I said, my Lark implementation is going more slowly than I hoped. I'm hoping to have something published very soon, though.

electorama / abif

Define a core data model for ABIF #15