Canonical form - Githubissues

yoonsikp commented 6 months ago

Any idea if we can have a defined canonical form? Similar to "canonical JSON" and protobufs. Might be contrary to the design feature of nestedtext with variable amount of white space per line/indentation. Would be good for hashing and diffs

KenKundert commented 6 months ago

Can you explain why this is needed?

yoonsikp commented 6 months ago

Let's say you have a long list of emails that are unsorted and you want to see if they are identical to another list from before. You could load it into memory, sort it, then do a compare. Or you convert to a canonical form then do a simple diff on the command line.

KenKundert commented 6 months ago

In your example, you assume that you have a previous copy of the file, correct? If so, you can always do a simple compare to see if the file has changed in any way. You don't even need the original file, you can just keep the hash of the file.

If instead that you wanted to check to see if the actual data has changed, it seems like the most reliable way of doing so would be to extract the data and compare against the earlier version. Comparing canonical representations would likely generate false positives, for example, when someone changes a comment or a blank line.

Comparing the data would not detect changes in the comments, but I think the way to address that is to allow access to the comments from the application, which could then be compared. That is something I expect to allow in the future when I get some time.

Supporting a standard canonical representation only seems important when the file is only :

automatically generated by multiple applications, all of which are compliant with the standard,
not expected to be hand edited, and
the files are very large and so a full extraction is undesirable. That is not really the use model for NestedText, which tries to encourage direct user interaction with the data and is largely intended for smaller datasets.

AndydeCleyre commented 4 months ago

For what its worth, you can load NestedText, and then output the data to new NestedText, to get a consistent format. I don't know if you want the data re-ordered/sorted, though, which this on its own wouldn't do. It also wouldn't preserve comments.

But for example, using the NestedTextTo CLI tools I made around this library:

messy1.nt:

People:
  -
    name: Flinderson Dorf
    notes:
      - first note
      - second note
  -
    name: Juminy Biscuit
    notes:
        - a
        - b
        - c

messy2.nt:

People:
  -
    name: Flinderson Dorf
    notes:
      -
        > first note
      - second note
  -
    name: Juminy Biscuit
    # This is a comment!
    notes:
      [a, b, c]

$ diff -u messy1.nt messy2.nt

--- messy1.nt   2024-04-18 14:06:22.606868898 -0400
+++ messy2.nt   2024-04-18 14:10:39.430900922 -0400
@@ -2,11 +2,11 @@
   -
     name: Flinderson Dorf
     notes:
-      - first note
+      -
+        > first note
       - second note
   -
     name: Juminy Biscuit
+    # This is a comment!
     notes:
-        - a
-        - b
-        - c
+      [a, b, c]

Using Zsh:

$ diff -u =(nt2json messy1.nt | json2nt) =(nt2json messy2.nt | json2nt)  # no output, results are identical
$ nt2json messy1.nt | json2nt

People:
  -
    name: Flinderson Dorf
    notes:
      - first note
      - second note
  -
    name: Juminy Biscuit
    notes:
      - a
      - b
      - c

$ nt2json messy2.nt | json2nt

People:
  -
    name: Flinderson Dorf
    notes:
      - first note
      - second note
  -
    name: Juminy Biscuit
    notes:
      - a
      - b
      - c

KenKundert / nestedtext

Canonical form #44