dedis / matchertext

Work-in-progress paper and experimental code on matchertext embeddable syntax discipline
9 stars 0 forks source link

Paper feedback: Can string literals in various languages be extended? #1

Open andychu opened 1 year ago

andychu commented 1 year ago

Very interesting idea and great paper. I've been working on similar "data languages" as complements to https://www.oilshell.org/

I wrote a shell script that I think demonstrates a practical issue with Section 4.2 : C-like Host Languages. That is, basically no languages give syntax errors for the proposed \[] or \m[] (blog post says \m[] )

So adding matchertext in the proposed way would technically be a breaking change. Some languages might have an evolution process for minor changes, but I highly doubt a language like JavaScript or C could do this.

Summary of results:

https://github.com/oilshell/oil/blob/master/demo/matchertext.sh

I'll paste the output of the script in the next comment

andychu commented 1 year ago

So you can see the syntax errors from JSON and Ninja, but not from any others. Awk gives a warning.

Also they don't output the same strings -- sometimes it's [], and sometimes it's \[]

Are the string literals in this language M-extensible?
We simply test them for syntax errors after a special char like \

This is also relevant to YSTR, where we add \xff and \u{012345} escapes

Traceback (most recent call last):
  File "_tmp/foo.c", line 2, in <module>
    json.loads('"\[]"')      
  File "/usr/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.6/json/decoder.py", line 355, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid \escape: line 1 column 2 (char 1)

[JSON] YES

---

ninja: error: _tmp/z.ninja:6: bad $-escape (literal $ must be written as $$)
build _tmp/$[ : copy _tmp/ninja-in
           ^ near here

[Ninja] YES

---

_tmp/z.mk:5: warning: overriding recipe for target '_tmp/make-out'
_tmp/z.mk:2: warning: ignoring old recipe for target '_tmp/make-out'
cp  _tmp/make-in _tmp/make-out

[GNU Make] NO, expected syntax error

---

_tmp/foo.c: In function ‘main’:
_tmp/foo.c:4:10: warning: unknown escape sequence: '\m'
   printf("\m[]\n");
          ^~~~~~~~

[C] NO, expected syntax error

---

Running C
[]
m[]

---
\[]
\m[]

[Python] NO, expected syntax error

---

\[]
\m[]

[Shell] NO, expected syntax error

awk: cmd. line:3: warning: escape sequence `\[' treated as plain `['
awk: cmd. line:4: warning: escape sequence `\m' treated as plain `m'
[]
m[]

[Awk] NO, expected syntax error

---

[]
m[]

[JavaScript] NO, expected syntax error

---
andychu commented 1 year ago

Here is a related data language I've been working on:

https://www.oilshell.org/release/latest/doc/qsn.html

The problems solved and relation to shell are laid out in the doc. It's basically cleaned up C string literals (based on Rust) that are more byte-string and utf-8 centric than JSON, which you need for Unix.


QSN is implemented in Oil now. But I started using it more, and what annoyed me is that it's not backward compatible with JSON.

JSON is a "narrow waist" with a lot of inertia.

So I'm working on a second iteration ("YSTR"), which is simple and small, but solves many problems. You could say the tagline is "one (cross-language) string literal syntax to rule them all"

Summary:


The justification for having matchertext in YSTR is basically as a "raw string", as you mention. It can prevent the "leaning toothpick" problem for:

Also I'd say as an analogy to s-expressions, it can represent recursive structure with concatenation. If you have to add levels of \\ then you're not just concatenating !


In shell you would use 'single quoted' strings to avoid \, but they can't represent single quotes.

Shell has 7 or 8 types of string literal to get around that! I posted some comments on today's matchertext lobste.rs thread about what I was thinking (before I read the paper):

https://lobste.rs/s/9ttq0x/matchertext_escape_route_from_language

(I can also suggest some improvements in terminology / presentation if interested, since it appears many people misunderstood it -- I think it's a great idea, though as the paper mentions, there are problems to be ironed out)