Closed yoonsikp closed 4 months ago
Can you explain why this is needed?
Let's say you have a long list of emails that are unsorted and you want to see if they are identical to another list from before. You could load it into memory, sort it, then do a compare. Or you convert to a canonical form then do a simple diff on the command line.
In your example, you assume that you have a previous copy of the file, correct? If so, you can always do a simple compare to see if the file has changed in any way. You don't even need the original file, you can just keep the hash of the file.
If instead that you wanted to check to see if the actual data has changed, it seems like the most reliable way of doing so would be to extract the data and compare against the earlier version. Comparing canonical representations would likely generate false positives, for example, when someone changes a comment or a blank line.
Comparing the data would not detect changes in the comments, but I think the way to address that is to allow access to the comments from the application, which could then be compared. That is something I expect to allow in the future when I get some time.
Supporting a standard canonical representation only seems important when the file is only :
For what its worth, you can load NestedText, and then output the data to new NestedText, to get a consistent format. I don't know if you want the data re-ordered/sorted, though, which this on its own wouldn't do. It also wouldn't preserve comments.
But for example, using the NestedTextTo
CLI tools I made around this library:
messy1.nt
:
People:
-
name: Flinderson Dorf
notes:
- first note
- second note
-
name: Juminy Biscuit
notes:
- a
- b
- c
messy2.nt
:
People:
-
name: Flinderson Dorf
notes:
-
> first note
- second note
-
name: Juminy Biscuit
# This is a comment!
notes:
[a, b, c]
$ diff -u messy1.nt messy2.nt
--- messy1.nt 2024-04-18 14:06:22.606868898 -0400
+++ messy2.nt 2024-04-18 14:10:39.430900922 -0400
@@ -2,11 +2,11 @@
-
name: Flinderson Dorf
notes:
- - first note
+ -
+ > first note
- second note
-
name: Juminy Biscuit
+ # This is a comment!
notes:
- - a
- - b
- - c
+ [a, b, c]
Using Zsh:
$ diff -u =(nt2json messy1.nt | json2nt) =(nt2json messy2.nt | json2nt) # no output, results are identical
$ nt2json messy1.nt | json2nt
People:
-
name: Flinderson Dorf
notes:
- first note
- second note
-
name: Juminy Biscuit
notes:
- a
- b
- c
$ nt2json messy2.nt | json2nt
People:
-
name: Flinderson Dorf
notes:
- first note
- second note
-
name: Juminy Biscuit
notes:
- a
- b
- c
Any idea if we can have a defined canonical form? Similar to "canonical JSON" and protobufs. Might be contrary to the design feature of nestedtext with variable amount of white space per line/indentation. Would be good for hashing and diffs