bible-technology / scripture-burrito

Scripture Burrito Schema & Docs 🌯
http://docs.burrito.bible/
MIT License
21 stars 13 forks source link

Should We Support Bi-directional XML<->JSON Conversion (custom or off the shelf) #56

Closed jag3773 closed 4 years ago

jag3773 commented 5 years ago

After #27 , we need to discuss whether or not we want to support Bi-directional XML<->JSON Conversion (custom or off the shelf)

mvahowe commented 5 years ago

I think I can do this for valid JSON. For invalid JSON it's impossible unless the JSON retains all XML properties, eg whether every field was an attribute or enclosed text, and no-one will want to use that kind of JSON.

Le jeu. 22 août 2019 à 17:20, Jesse Griffin notifications@github.com a écrit :

After #27 https://github.com/bible-technology/scripture-burrito/issues/27 , we need to discuss whether or not we want to support Bi-directional XML<->JSON Conversion (custom or off the shelf)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bible-technology/scripture-burrito/issues/56?email_source=notifications&email_token=AAMA5EUXONKCS77UGTBALQLQF2VDDA5CNFSM4IOWIH2KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HG2EUEA, or mute the thread https://github.com/notifications/unsubscribe-auth/AAMA5EUPWVJMGOCUM2GNJFLQF2VDDANCNFSM4IOWIH2A .

rdb commented 5 years ago

I think if the question is "should it be possible to convert the JSON back to XML without knowing that it's SB" then I'd say no.

However, I think it's conceivable that whatever JSON format we produce, someone with knowledge of what SB XML looks like will be able to take such a JSON file and produce an XML that is more-or-less functionally or semantically equivalent to the JSON.

So I think maybe we need to define this a little more precisely.

mvahowe commented 5 years ago

I have roundtripping XML <=> JSON in several places in DBL, and I noticed that I hve stub code for this in the metadata class. It looks to me like 2 days' work and I can't imagine why it wouldn't work on valid data. The fun begins if, eg, the JSON has a list where there should be an object. Whatever I do with that will be "wrong", the most likely result will be a missing branch of the DOM tree (via a caught exception).

This is probably ok for editing JSON within a validating framework, but hand hacking of JSON is going to end badly. I think that's a data structures reality rather than an implementational issue.

rdb commented 5 years ago

I hadn't thought that we at all need to care about the possibility of invalid JSON? JSON is by design a machine-readable/writable format and not really a human-writable format, so it seems unlikely that people will be hand-hacking them. I had assumed so far we would be providing a JSON schema for machine validation.

jonathanrobie commented 5 years ago

I would really prefer to focus on the Minimum Required to Declare Victory and on the serialized format contained in our files. I would also prefer to specify each format once.

Is there an interoperability need behind this? If we say that a format for a given file is, say, JSON or XML, will specifying a conversion between JSON and XML improve interoperability in an important scenario? If not, I would say we do not need to specify this. Interoperability is the driver for standards.

For cost-benefit analysis, interoperability is the benefit. The cost is the time and effort needed to agree on a way to do conversion, add it to the specification, get feedback on it, and maintain it over time. Because formats can be such a bikeshed, this cost could be significant enough to make it harder to do something more important.

mvahowe commented 5 years ago

DBL needs this do I'm going to finish it, and am willing to share. If the committee wants to ignore it for procedural reasons, that's ok but, sooner or later, most implementations will need a JSON-like representation internally, and it's relatively hard to do without good knowledge of the schema.

rdb commented 5 years ago

Yes, to be clear, the Registry will have a JSON representation one way or the other (and will quite hopefully be completely oblivious to the XML representation). It would just be better if it used the same JSON representation as everyone else.

jonathanrobie commented 5 years ago

If the registry is using JSON regardless, do we need XML too? If so, why?

Jonathan

On Fri, Aug 23, 2019 at 10:14 AM rdb notifications@github.com wrote:

Yes, to be clear, the Registry will have a JSON representation one way or the other (and will quite hopefully be completely oblivious to the XML representation). It would just be better if it used the same JSON representation as everyone else.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bible-technology/scripture-burrito/issues/56?email_source=notifications&email_token=AANPTPN5R2YBK7L5KPQYBIDQF7WF5A5CNFSM4IOWIH2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5AKVPY#issuecomment-524331711, or mute the thread https://github.com/notifications/unsubscribe-auth/AANPTPO33FD2LUWCPV3MKQ3QF7WF5ANCNFSM4IOWIH2A .

rdb commented 5 years ago

I'm not entirely sure what you're asking, but I'll answer what I think you're asking: the Registry doesn't need to understand XML because the Registry will never be looking at an actual sburrito. It is only concerned with the metadata that will eventually go into a burrito. (Even if you asked me to have the Registry provide the metadata API in XML format--which I truly hope you do not--it would still be working with it in a JSON-like object structure internally, and storing it in BSON format in the database.)

If I understand correctly, it's the DBL uploader that will pull the metadata from the Registry and fill the values into an XML sburrito metadata file. I expect that just putting the values from the JSON API into an XML format should be fairly straightforward, as long as the JSON and XML don't disagree on the actual values to put in the fields. This is what I mean when I say that to some extent, there's no question in my mind that it has to be possible to take a JSON format and produce a matching XML, given knowledge of the burrito specs.

jonathanrobie commented 5 years ago

Suppose the XML sburrito metadata file were a JSON burrito metadata file instead. For DBL, at least, it would mean we don't need the mapping.

I assume JSON files are easy for most apps to read. Are there apps that need this file to be in XML? If so, why?

If there is a real need for this to be in XML inside the burrito, and DBL needs it to be JSON, then DBL will need to map. Are there other applications that would use this same mapping to interoperate with DBL? If so, it's worth standardizing. Or at least putting in a non-normative appendix.

On Fri, Aug 23, 2019 at 1:18 PM rdb notifications@github.com wrote:

I'm not entirely sure what you're asking, but I'll answer what I think you're asking: the Registry doesn't need to understand XML because the Registry will never be looking at an actual sburrito. It is only concerned with the metadata that will eventually go into a burrito. (Even if you asked me to have the Registry provide the metadata API in XML format--which I truly hope you do not--it would still be working with it in a JSON-like object structure internally, and storing it in BSON format in the database.)

If I understand correctly, it's the DBL uploader that will pull the metadata from the Registry and fill the values into an XML sburrito metadata file. I expect that just putting the values from the JSON API into an XML format should be fairly straightforward, as long as the JSON and XML don't disagree on the actual values to put in the fields. This is what I mean when I say that to some extent, there's no question in my mind that it has to be possible to take a JSON format and produce a matching XML, given knowledge of the burrito specs.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bible-technology/scripture-burrito/issues/56?email_source=notifications&email_token=AANPTPMMFKIHHHUTNWLBKHTQGALUVA5CNFSM4IOWIH2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5A2AKQ#issuecomment-524394538, or mute the thread https://github.com/notifications/unsubscribe-auth/AANPTPM3CR2NF7GZBUMJQEDQGALUVANCNFSM4IOWIH2A .

mvahowe commented 5 years ago

We need XML because JSON validation sucks. We need JSON because it's the easiest way to get clean serialization via nested dictionaries or hash tables. That's what we agreed in Orlando. That's what I'm going to deliver next week.

You can't roundtrip JSON => SB XML => JSON without loss. If you disagree, take a Twitter JSON feed and turn it into valid SB XML, without losing any data, and I'll happily admit I'm wrong

mvahowe commented 5 years ago

I now have XML => JSON working, with bespoke JSON for the generic parts and (fairly) non-lossy JSON for flavorDetails. (We need this for the x-flavors since we can't make any assumptions about the semantics of that element. So this provides a good illustration of the bespoke/non-lossy approach. Non-lossy is

"flavorDetails": {
      "contentType": {
        "text": "pdf"
      },
      "pod": {
        "text": "true"
      },
      "width": {
        "text": "140mm"
      },
      "height": {
        "text": "210mm"
      },
      "scale": {
        "text": "100%"
      },
      "orientation": {
        "text": "portrait"
      },
      "color": {
        "text": "CMYK"
      },
      "pageCount": {
        "text": "193"
      },
      "edgeSpace": {
        "children": {
          "top": {
            "text": "5mm"
          },
          "bottom": {
            "text": "9mm"
          },
          "inside": {
            "text": "13mm"
          },
          "outside": {
            "text": "9mm"
          }
        },
        "text": "\n            "
      },
      "thumbnail": {
        "children": {
          "width": {
            "text": "240"
          },
          "height": {
            "text": "360"
          },
          "color": {
            "text": "RGB"
          }
        },
        "text": "\n            "
      },
      "fonts": {
        "children": {
          "font": {
            "attributes": {
              "type": "OpenType"
            },
            "text": "Segoe UI Symbol Regular"
          }
        },
        "text": "\n            "
      }

(This approach won't work for mixed content, like HTML, and I think that's a feature, not a bug. It also won't preserve the order of elements, and it won't preserve multiple elements at the same level with the same tag name. All this is possible but means even more boilerplate like, well, all the generic XML to JSON convertors.)

Imagine we want the top edgeSpace in some application code, and that this approach was applied to the entire document. The JSON "path" for this would be something like

document["children"]["type"]["children"]["flavorDetails"]["children"]["edgeSpace"]["children"]["top"]["text"]

This is simply awful. It's much worse than anything DBL (or, I think, PTReg) inflicts on developers. It's bad enough that everyone will be tempted to roll their own JS, which is what we have now, and is a different kind of awful.

The bespoke version would be something like

document["type"]["flavorDetails"]["edgeSpace"]["top"]

I rest my case.

mvahowe commented 5 years ago

The bespoke version of the above JSON looks like this. (As well as being shorter and more readable, the fields can be of the right type.)

"flavorDetails": {
      "contentType": "pdf",
      "pod": true,
      "width": "140mm",
      "height": "210mm",
      "scale": "100%",
      "orientation": "portrait",
      "color": "CMYK",
      "pageCount": 193,
      "edgeSpace": {
        "top": "5mm",
        "bottom": "9mm",
        "inside": "13mm",
        "outside": "9mm"
      },
      "thumbnail": {
        "color": "RGB",
        "width": 240,
        "height": 360
      },
      "fonts": {
        "Times New Roman Regular": "OpenType",
        "Times New Roman BoldBold": "OpenType",
        "Times New Roman ItalicItalic": "OpenType",
        "Segoe UI Symbol Regular": "OpenType"
      }
    }
jonathanrobie commented 5 years ago

@mvahowe Could you please post the XML file that maps into these examples?

mvahowe commented 5 years ago

@jonathanrobie https://docs.burrito.bible/en/latest/appendix_example_scripture_print_document_pdf.html

mvahowe commented 5 years ago

@jonathanrobie Full, bespoke JSON for all current metadata documents at https://docs.burrito.bible/en/develop/appendix_json_example_documents.html

mvahowe commented 5 years ago

For reference, the original Scripture Burrito announcement that we drafted in Orlando, and which is currently in the repo's readme, says

The proposed format is based on the forthcoming DBL Metadata 2.3, which already offers many of the desired features. SB Metadata 0.1 has an XML and a JSON expression . Content servers may store SB metadata in either format and should allow metadata input and output for any burrito in either format.

I still think that's viable, with the proviso that metadata in either format needs to be valid. If we have to pick one, I think that, for Burrito transport, it has to be XML, because servers and clients should validate input and output. ie, if we force everyone to use JSON, we encourage everyone to not validate content, or alternatively, to turn the JSON into XML, both on the client and the server.

rdb commented 5 years ago

The Registry will validate the JSON either way (as it currently already validates all of its own JSON expressions of the metadata using its own schemas). Will we make a JSON schema in some ubiquitous format available or will everyone who wants to validate JSON be expected to roll their own?

I'm not really sure why you think using JSON encourages not validating the content. If people weren't going to validate the JSON, they weren't going to validate XML either--in fact, I'm still not validating XML in many places in the Registry because I never managed to get XML validation to actually work in Node.js.

mvahowe commented 5 years ago

@rdb I was under the impression that there was no "ubiquitous format" for JSON validation, but I'm willing to be told otherwise. Regardless, maintaining the same schema, independently, for JSON and XML seems like a recipé for Edge Case Hell.

rdb commented 5 years ago

For the record, JSON schema is the ubiquitous format that you're likely to encounter any time you search for a JSON validation library or an on-line JSON schema validator. There's libraries for a wide range of languages and it's even used in a Khronos specification.

Here is a stub for a JSON schema I quickly slapped together that validates some sections of the JSON example in the docs. Start reading at metadata.schema.json at the bottom: https://gist.github.com/rdb/7664e9a51b75320621999b04c6e766a2

To validate, put it into any on-line JSON schema validator or try this:

python -m pip install jsonschema
python -m jsonschema -i example.json metadata.schema.json

As far as maintaining the same schemas, as long as there are people working with the XML representation and there are people working with the JSON representation, someone out there will be maintaining a JSON schema. I'd personally rather it be the case that multiple people aren't doing the same work independently, but I can maintain my own JSON schemas myself if it comes to that.

jonathanrobie commented 5 years ago

Regardless, maintaining the same schema, independently, for JSON and XML seems like a recipé for Edge Case Hell.

This is my concern.

mvahowe commented 4 years ago

@jag3773 At this point I think this conversation is moot, so we can close this issue?