llChar optimization causing generic lexer error

Tonaie commented 1 year ago

My code is using llChar to convert numbers to chars that are outside of what the compiler can handle. This causes error: invalid character '\002' in input stream when the optimizer tries to convert it to a string.

Example: llChar( 2) gets replaced with "" which causes the error.

Sei-Lisa commented 1 year ago

I can't reproduce. It's working for me on Linux, and I can copy and paste the resulting script to SL, add llEscapeURL around the string, and it returns %02 as expected.

If you're using Windows, you can try entering chcp 65001 before running the optimizer to see if that helps.

If that doesn't help, can you tell me your OS, Python version and command line options that you're passing to the optimizer?

Tonaie commented 1 year ago

I am using windows 10, python 2.7.16, notepad++ with encoding UTF-8. chcp 65001 had no effect.

Command run is python2 main.py -O +ShrinkNames _input.lsl > "_output.lsl"

Trying python2 main.py --bom _input.lsl | clip has the same issue, pasting into the LSL script editor does not work.

Including _output.lsl also has the same error.

I did a git pull before optimizing. So it it using the latest code.

Sei-Lisa commented 1 year ago

I might be able to give it a try on Windows 11 next Monday, not sure. Meanwhile, here are a couple suggestions:

Use the -o flag instead of redirection: python main.py -O shrinknames _input.lsl -o _output.lsl
The optimizer was recently updated to work with Python 3 too. Try changing Python to version 3, see if that helps.

Tonaie commented 1 year ago

Same problem with o flag. Python3 gives error UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 6825: character maps to <undefined>

Sei-Lisa commented 1 year ago

I see. Can you please turn Python exceptions on (flag -y) and post the traceback from the error?

Tonaie commented 1 year ago

Traceback (most recent call last):
  File "...\Pyoptimizer\main.py", line 782, in <module>
    ret = main(sys.argv)
  File "...\Pyoptimizer\main.py", line 603, in main
    script = f.read()
  File "...\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 328: character maps to <undefined>

Sei-Lisa commented 1 year ago

Okay, chcp 65001 might help with the Python 3 error. If that doesn't help, I will have to try on Windows myself.

Tonaie commented 1 year ago

Okay, chcp 65001 might help with the Python 3 error. If that doesn't help, I will have to try on Windows myself.

Same problem

Sei-Lisa commented 1 year ago

After talking inworld, it turns out that the original error comes from the Firestorm preprocessor, not from the optimizer.

There's little that can be done about that. The LSL compiler is quite lenient with accepting extraneous characters, but Boost::Wave (the C preprocessor library used by Firestorm) complains. The only advice I can give is to disable the preprocessor when handling a source file containing such characters.

As for the UnicodeDecodeError exception, that's a different issue but I can't reproduce it, and there isn't enough information to know how that 0x90 character appeared in the source file and more importantly, why Python is using Codepage 1252 instead of 65001 (UTF-8) even when specifying chcp 65001. So I'm closing this issue because there's nothing that can be done with this information.

Feel free to reopen it if you can put your finger on what causes the problem and there's something that can be done to fix it.

Sei-Lisa commented 1 year ago

I tried in Windows 11 and the results are disheartening. The Python 2 for Windows version available for download from python.org is compiled with UTF-16 Unicode strings instead of UTF-32, and its support for UTF-16 is really poor. Python 3 has some severe issues: it doesn't obey the console's encoding (i.e. the code page you set with chcp), and it uses the default Windows encoding when reading files, causing havoc with applications that read UTF-8 files like the optimizer, c.f. https://discuss.python.org/t/pep-686-make-utf-8-mode-default/14435 which makes keeping it polyglot (Python 2 and Python 3) more troublesome. Apparently some version of Python, maybe 3.12 or so, will default to UTF-8 for reading files, but we're not there yet, and requiring 3.12 is a very strict requirement anyway, considering I'm still using Python 3.5 on Linux.

On top of that, there are line ending issues because I never tested with Windows' crazy two-character line endings. The unit tests in particular give errors when they are checked out with CRLF line endings.

To sum up, it seems there's still quite some work to do to properly support Windows' encoding and line ending quirks. So far I've pushed a fix for lslloadlib which, I believe, solves the issues it had on Windows when reading builtins.txt and fndata.txt.

Sei-Lisa / LSL-PyOptimizer

llChar optimization causing generic lexer error #20