Open YourFin opened 5 months ago
I believe this behavior is actually correct, and all Ion parsing implementations that I've tried with this input (ion-c, ion-rust, ion-java) behave the same way.
Note that this does not prevent you from encoding the text x'
using a long string. You simply need to escape the '
character that is intended to be part of the data, e.g. '''x\''''
.
A similar thing happens when using short strings, e.g. """
produces an error, but "\""
produces the text "
.
Ah, I forgot about quoted symbols.
If Ion did not allow for quoted symbols, I think there would be a stronger case for not tokenizing ‘’’
greedily. As it stands, though, that greatly complicates the “seven single quotes in a row” handling in the cases where a symbol is valid following a string, and the permissive interpretation would probably come with some nasty surprises.
Sorry for the spurious report! Glad to see I was wrong on this one.
Would you accept a contribution of seven single quotes in a row as bad test data? I couldn't find it there, and it would be nice to make this policy explicit. On the other hand, it would be a bit of an odd case since the "bad" test data would start with a valid string.
Yes, I think that would be a good test case to add to bad/
.
When using the
ion
command line tool (and by extension this library), my understanding is that a file containing:should be equivalent to
Instead, it reads ~as:
I'm submitting this from my work laptop (feel free to look me up internally), but I can supply the actual error message from my personal machine when I get home if that would be helpful. I also have not tested this with the non-steam parser; this came up as I was attempting to a validate an ion tree-sitter grammar.
The
x
isn't required to surface the bug -'''''''
also fails to read as"'"
, but thex
makes the example string easier to read.I suspect this is a result of greedily treating three single quotes in a long string as termination, despite that not necessarily being the case.
As far as I can tell, there isn't any reason that this process should be greedy. Should a bunch of single quotes show up in a row at the end of a long string, parsing them while allowing for this case is unambiguous (1).
In terms of streaming performance, I also cannot see a reason why allowing for such strings would cause a change in behavior. Long strings can be split across multiple sections, so a full string can't be emitted from a stream reader when one long-string section is terminated.
E.g.: if a process was reading a stream of data via an ion stream reader:
The long string starting in Packet 1 could not be processed fully processed until T3 anyways, due to the possibility of another long string section.
(1):
At which point we loop back to the first state should we see another
'