Kampfkarren / full-moon

A lossless Lua 5.1 parser
Mozilla Public License 2.0
387 stars 51 forks source link

Luau string interpolation #235

Closed Kampfkarren closed 1 year ago

Kampfkarren commented 2 years ago

https://github.com/Roblox/luau/pull/614

sno2 commented 2 years ago

Hello, should such a feature be hidden under a Cargo feature (e.g. "rfc-string-interpolation") or can this be added on main?

Kampfkarren commented 2 years ago

It should just be added under the roblox feature flag, which is the misnomer feature flag for Luau. String interpolation is merged into Luau proper now as of https://github.com/Roblox/luau/pull/614.

sno2 commented 2 years ago

Ok, thanks. I am going to try to implement this.

Kampfkarren commented 2 years ago

Good luck! You can feel free to reference the code I wrote for it in Luau (written in C++).

Make sure to include the test cases I gave to them as well, both for passing and failure.

sno2 commented 2 years ago

Good luck! You can feel free to reference the code I wrote for it in Luau (written in C++).

Thanks, I'll make sure to check it for the structures.

Make sure to include the test cases I gave to them as well, both for passing and failure.

👍

sno2 commented 2 years ago

Hello, I am back again. Anyways, I have some solemn news. Basically, I came up with an approach that works well, but it utilizes logos extras to attach abstract data to the tokenizer which imposes some limitations which require handling.

https://docs.rs/logos/latest/logos/trait.Logos.html#associatedtype.Extras

The problem is, we need to store a count of the open braces within template interpolations at a given moment. But, we also need to store that count for every template literal and the template literals nested within that template literal. Therefore, we need some sort of Vec<usize> to hold the counts for each level. However, Tokenizer implements Copy, and Vec<usize> does not. So, it's impossible to get some sort of dynamic allocation in the extras without unsafe code which would definitely be unsound.

You could also use a [usize; 8] and just assume the user would not nest more than 8 template literals, but that would also expose users using this library to an extremely easy security vulnerability. Not sure the route to take from here.

Kampfkarren commented 2 years ago

Will investigate further, CC @Rerumu if they have any ideas

sno2 commented 2 years ago

Will investigate further, CC @Rerumu if they have any ideas

Ok, thanks. I've created a draft PR with what I attempted in #249

Kampfkarren commented 2 years ago

Do you mean Atom isn't Copy anymore? Does it need to be?

sno2 commented 2 years ago

No, see the FIXME in the draft PR. We need to dynamically grow a vector in the extras struct somehow but we can't because it needs Copy.

Rerumu commented 2 years ago

It should be possible to implement interpolation without extras or messing with containers I think. Depending on how we want to lex it there are multiple approaches.

There is the lazy approach of lexing backticks as a normal string from end to end. The handler function can deal with the nesting by setting up a stack and whatnot without having to store it in the lexer itself. After this, a secondary lexer can be used at a later stage to extract the inner tokens from the string.

If we want an eager approach instead, we can make a single backtick pattern and have whatever code is handling it switch over to an appropriate secondary lexer upon encountering that "signal".

Also of note is that morphing between lexers in logos is practically free: https://docs.rs/logos/latest/logos/struct.Lexer.html#method.morph

sno2 commented 2 years ago

Thank you for your insight,

There is the lazy approach of lexing backticks as a normal string from end to end. The handler function can deal with the nesting by setting up a stack and whatnot without having to store it in the lexer itself. After this, a secondary lexer can be used at a later stage to extract the inner tokens from the string.

How would you handle nesting in that case? Wouldn't you still not be able to detect if there is a template literal nested within a template literal? Also note that the Lua syntax does not have the dollar sign of JS-style template literals (${).

If we want an eager approach instead, we can make a single backtick pattern and have whatever code is handling it switch over to an appropriate secondary lexer upon encountering that "signal".

Also of note is that morphing between lexers in logos is practically free: https://docs.rs/logos/latest/logos/struct.Lexer.html#method.morph

Alright, I was hoping to stray from modifying the codebase too much. I'll try this approach when I get some time.

Rerumu commented 2 years ago

For the first case you'd handle the balancing within the handler function. You could probably just run the second lexer within that handler to help skip along to the end of the string. In hindsight I guess this would end up being the same behavior as my other suggestion, it's just up to whether you want to lex the entire string at once or incrementally.