amazon-ion / ion-c

A C implementation of Amazon Ion.
https://amazon-ion.github.io/ion-docs/
Apache License 2.0
166 stars 43 forks source link

Stream parser fails to read long strings ending in a single quote #340

Open YourFin opened 5 months ago

YourFin commented 5 months ago

When using the ion command line tool (and by extension this library), my understanding is that a file containing:

'''x''''

should be equivalent to

"x'"

Instead, it reads ~as:

"x"
Syntax error: unexpected "'" before end of file

I'm submitting this from my work laptop (feel free to look me up internally), but I can supply the actual error message from my personal machine when I get home if that would be helpful. I also have not tested this with the non-steam parser; this came up as I was attempting to a validate an ion tree-sitter grammar.

The x isn't required to surface the bug - ''''''' also fails to read as "'", but the x makes the example string easier to read.


I suspect this is a result of greedily treating three single quotes in a long string as termination, despite that not necessarily being the case.

As far as I can tell, there isn't any reason that this process should be greedy. Should a bunch of single quotes show up in a row at the end of a long string, parsing them while allowing for this case is unambiguous (1).

In terms of streaming performance, I also cannot see a reason why allowing for such strings would cause a change in behavior. Long strings can be split across multiple sections, so a full string can't be emitted from a stream reader when one long-string section is terminated.

E.g.: if a process was reading a stream of data via an ion stream reader:

---- Packet 1, T1 ----
'''\
Some long string
'''
---- Packet 2, T2 = T1 + 10 seconds ----
'''\
Line 2 of a long string
'''
---- Packet 3, T3 = T2 + 10 seconds ----
123

The long string starting in Packet 1 could not be processed fully processed until T3 anyways, due to the possibility of another long string section.


(1):

'''
x'// Parser state: a superposition of:
// - "in the middle of a long string section, last two characters in the long string are 
//   single quotes"
//   String so far: "\nx'"
//   ^ Non ' will collapse to this branch
// - "In the middle of a long string section start token"
//   String so far: "\nx"
'''
x''// Parser state: a superposition of:
// - "in the middle of a long string section, last two characters in the long string are 
//   single quotes"
//   String so far: "\nx''"
//   ^ Non ' will collapse to this branch
// - "In the middle of a long string section start token"
//   String so far: "\nx"
'''
x'''// Parser state: Long string containing "\nx"
'''
x''''// Parser state: a superposition of:
// - "A long string section, the contents of which end with a single quote" 
//   String so far: "\nx'"
//   ^ Non ' will collapse to this branch
// - "One ' read in a ''' long string section start token"
//   String so far: "\nx"
'''
x'''''// Parser state: a superposition of:
// - "A long string section, the contents of which end with a single quote"
//   String so far: "\nx''"
//   ^ Non ' will collapse to this branch
// - "Two ' read in a ''' long string section start token"
//   String so far: "\nx"
'''
x''''''// Parser state: Just started a new long string section. String so far: "\nx"
'''
x'''''''// Parser state: a superposition of: 
// - Just started a new long string section. 
//   String so far: "\nx'"
//   ^ Non ' will collapse into this branch
// - One ' into a ''' long string end token. 
//   String so far: "\nx"
'''
x''''''''// Parser state: a superposition of: 
// - Just started a new long string section. 
//   String so far: "\nx''"
//   ^ Non ' will collapse into this branch
// - Two ' into a ''' long string end token. 
//   String so far: "\nx"
'''
x'''''''''// Parser state: Just started a new long string section. String so far: "\nx"

At which point we loop back to the first state should we see another '

tgregg commented 5 months ago

I believe this behavior is actually correct, and all Ion parsing implementations that I've tried with this input (ion-c, ion-rust, ion-java) behave the same way.

Note that this does not prevent you from encoding the text x' using a long string. You simply need to escape the ' character that is intended to be part of the data, e.g. '''x\''''.

A similar thing happens when using short strings, e.g. """ produces an error, but "\"" produces the text ".

YourFin commented 5 months ago

Ah, I forgot about quoted symbols.

If Ion did not allow for quoted symbols, I think there would be a stronger case for not tokenizing ‘’’ greedily. As it stands, though, that greatly complicates the “seven single quotes in a row” handling in the cases where a symbol is valid following a string, and the permissive interpretation would probably come with some nasty surprises.

Sorry for the spurious report! Glad to see I was wrong on this one.

Would you accept a contribution of seven single quotes in a row as bad test data? I couldn't find it there, and it would be nice to make this policy explicit. On the other hand, it would be a bit of an odd case since the "bad" test data would start with a valid string.

tgregg commented 5 months ago

Yes, I think that would be a good test case to add to bad/.