Python string RegEx does not match multiline strings

Lumpy-dev commented 3 months ago

Hello,

When I was porting the new changes made to unrpyc to my project, I had problems with the python string RegEx (because I had to port it from the re flavor to the ECMAScript one). While trying to port it, I remade some tests for this RegEx here and found that multiline strings were not matched correctly (see line 18 to 20 or image):

The output of the file is still valid (it is the same as in 01example.rpy). I don't really know how the line is split at the end of the string like this to make it work: ' """\n Called to reset the example code to the defaults.\n """ but I find it a bit odd the way we get to this.

NOTE 1: I am really not experienced with RegEx, I might be missing something here.

NOTE 2: If someone wants a close-to-complete ECMAScript version of this RegEx, it's just here.

madeddy commented 3 months ago

Hi. Two things come to mind:

maybe add to which code/func/module in unrpyc this refers (citing.. link)
afaik multiline strings have in py special rules regarding indent and newline preservation. Maybe this influences your results and i think there was a PEP for this.

Lumpy-dev commented 3 months ago

Here's the function I am refeering to: https://github.com/CensoredUsername/unrpyc/blob/79895a10b5b0d93fabc8757aa345981b6bfeb11b/decompiler/util.py#L390-L397 I want to add the fact that the comment talks about not parsing docstrings.

But this makes my think of more issues:

How can we differentiate docstrings and multiline strings? / How do we already do this? (if we do it at all)
Why is there a match against an empty string for a triple quote in the Python RegEx? If we just never try match this, it should be added in the comment above the RegEx.
Where is this PEP 257 Implementation done regarding multiline docstrings? Is it implicit somewhere? If so, it should also be added in comments

CensoredUsername commented 3 months ago

How can we differentiate docstrings and multiline strings? / How do we already do this? (if we do it at all)

You can't, because docstrings are literally just multiline strings. There's no reason to differentiate between them.

Why is there a match against an empty string for a triple quote in the Python RegEx? If we just never try match this, it should be added in the comment above the RegEx.

That's just the result of the regex not recognizing the mutliline string, and thus it falls back on matching the initial two quotes as an empty string.

The issue exists because you're not compiling the regex with the same options that unrpyc does. the python code ends up running that regex with re.DOTALL enabled, which means . also matches newlines. Here's it with that option enabled.

Where is this PEP 257 Implementation done regarding multiline docstrings? Is it implicit somewhere? If so, it should also be added in comments

Docstrings are literally just normal python strings.

Now there's also a general note here. This function isn't trying to be absolutely correct. It's trying to do what ren'py does here. For the purpose of checking that we aren't generating output that ren'py would fail on. If it fails to parse something that has become legal, we'll just put some parenthesis around it and be done with it (see simple_expression_guard).

(Now that function has had some recent changes, like accepting [urfURF] as allowed prefixes, but as I said before, false negatives are okay here, and we can fix that when we get there).

CensoredUsername / unrpyc

Python string RegEx does not match multiline strings #208