KenKundert / nestedtext

Human readable and writable data interchange format
https://nestedtext.org
MIT License
362 stars 13 forks source link

Thoughts on a Sentinel Value to Denote File Begin / End? #7

Closed jamesdbowman closed 3 years ago

jamesdbowman commented 3 years ago

First of all, thanks for taking the time to publish this repo.

A common complaint about YAML-esque markup languages is that incomplete subsections of the file can be validly parsed, leading to data loss in some situations.

Would you consider extending nestedtext markup to include a sentinel value denoting the beginning and end of a file? I imagine the slight reduction in the initial usability of the markup language would be worth the gains in file integrity (for those users who choose not to implement their own validation / checksums).

KenKundert commented 3 years ago

Thanks for the suggestion. It seems like a good idea. Let me think about how I might do that.

kalekundert commented 3 years ago

Hmm, that's an interesting idea. Can you give some examples of how a file might get truncated? I haven't experienced this first-hand, but I think it's important to know how data can be lost in order to respond appropriately. In particular, would it be necessary to have sentinels at the beginning and end of a file, or would be sufficient to just have a sentinel at the end? Similarly, if data can be lost from the middle of the file (and not just the ends), then having sentinels wouldn't really accomplish much: you'd still need a checksum to know if you got the whole file.

Another thing I'm thinking about is whether it would make sense to make the beginning/end sentinels optional by default, but required if an argument explicitly requested by the load() caller, e.g. load(require_ends=True). The advantages of this are that it wouldn't break existing files (not that there are very many yet) and it wouldn't require extra markup for applications where file integrity isn't a concern. The disadvantages are that most people would probably not start using the sentinels until they'd been bitten by a truncation bug, and even then they'd have to go back and add sentienels to all of their existing files, which could be really painful.

A final comment is that we view nestedtext as having a single responsibility, which is to repesent the structure of data. Just as we defer to other tools to deal with the validation of that data, it would be natural for us to defer to other tools to maintain file integrity. Especially if there's not a one-size-fits-all solution to this.

jamesdbowman commented 3 years ago

Here's one example I could find of YAML truncation leading to a valid parse but invalid results.

That's a good point, though, about middle-file-truncation or corruption being dangerous. Giving users false-confidence in their use of sentinel values could lead to other types of file integrity errors slipping through the cracks.

As a fun thought exercise, the sentinel could also embed a file checksum, for optional use by the hyper-vigilant...

:BEGIN:
file:
 - contents
:END:{md5=595f44fec1e92a71d3e9e77456ba80d1}

But this is definitely too far away from the "defer to other tools" ideology. Feel free to close this bug if the recommendation is that users should defer to other tools to maintain file integrity if they so desire. Thanks for taking a look.

KenKundert commented 3 years ago

We are still noodling on this. One thought is to support an optional begin/end as a pair, where the begin and end can include a small amount of metadata, like the signature you suggested. Other metadata might be the version number for the NestedText file format and the name or source of the intended schema (this is the subject of another issue being discussed). The signature is interesting because one can use it to determine if the contents were hand modified, which could be used to trigger a recompile or rebuild. Of couse one could always use md5sum to implement something similar outside NestedText, but the md5 signature computed by NestedText could easily exclude comments and blank lines.

KenKundert commented 3 years ago

At this point we are trying to maintain the simplicity of the language and so have decided to hold off on this idea and see if the case for it strengthens. I will leave you with one thought though. You can implement this your self in your application by simply inserting a start key and end key. Your application can then check that they are both there. For example:

BEGIN:
file:
   - contents
END:

In this example, BEGIN and END have empty values, but you can use them for version numbers and integrity codes.