opeongo commented 5 years ago

This patch fixes a couple of tests that were failing in a Window environment. It replaces the hardcoded path and line separator characters with system properties.

codecov[bot] commented 5 years ago

Codecov Report

Merging #494 into merge-fork-beanshell2 will decrease coverage by 0.24%. The diff coverage is n/a.

@@                     Coverage Diff                     @@
##             merge-fork-beanshell2     #494      +/-   ##
===========================================================
- Coverage                    70.22%   69.97%   -0.25%     
- Complexity                    2624     2633       +9     
===========================================================
  Files                          103      105       +2     
  Lines                         8632     8917     +285     
  Branches                      1688     1757      +69     
===========================================================
+ Hits                          6062     6240     +178     
- Misses                        2192     2254      +62     
- Partials                       378      423      +45

Impacted Files	Coverage Δ	Complexity Δ
src/main/java/bsh/ReflectError.java	`33.33% <0%> (-33.34%)`	`1% <0%> (-1%)`
src/main/java/bsh/DelayedEvalBshMethod.java	`62.68% <0%> (-25.32%)`	`17% <0%> (-1%)`
src/main/java/bsh/Reflect.java	`70.27% <0%> (-9.44%)`	`144% <0%> (-43%)`
src/main/java/bsh/BSHPrimarySuffix.java	`77.02% <0%> (-4.8%)`	`56% <0%> (-1%)`
src/main/java/bsh/BlockNameSpace.java	`68.18% <0%> (-4.55%)`	`8% <0%> (-2%)`
src/main/java/bsh/BSHArrayInitializer.java	`94.05% <0%> (-3.97%)`	`45% <0%> (-2%)`
src/main/java/bsh/BshArray.java	`94.91% <0%> (-3.39%)`	`74% <0%> (-3%)`
src/main/java/bsh/commands/dir.java	`58.49% <0%> (-1.89%)`	`4% <0%> (ø)`
src/main/java/bsh/LHS.java	`65.4% <0%> (-1.47%)`	`35% <0%> (-1%)`
src/main/java/bsh/Types.java	`80.68% <0%> (-1.46%)`	`135% <0%> (+6%)`
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 812bd9b...d0328b2. Read the comment docs.

nickl- commented 5 years ago

Awesome! TYVM @opeongo

I need to get a go to windows environment configured. There are a few open windows issues that also need some TLC.

nickl- commented 5 years ago

I also saw some unicode issues when last I tried to build on windows, hoping this patch beanshell/beanshell@5a9a685b61081b6a125bf24e7598ea0388b24be5 resolves that too.

opeongo commented 5 years ago

I had a look at the windows issues. Most of them seem to be non-bugs, con't reproduce, or issues that have already bee fixed but not closed out. Would you like me to go through and categorize them?

nickl- commented 5 years ago

Would you like me to go through and categorize them? @opeongo Yes please!! Even if you just report your findings, would be awesome thanx.

nickl- commented 5 years ago

As for the line endings, we already carry System.getProperty("line.separator") as bsh.Interpreter.Console.systemLineSeparator

I'm wondering if we shouldn't fix this somewhere else, in other words not in the tests. For example if I had a script using "\n" as these tests do, and I run it under windows should we not let the script succeed with the correct line endings for windows?

What do you think?

opeongo commented 5 years ago

Well, this is a bit curious. I went to have a look at where the \\ line endings were coming from. It seems that the parser is just making the string from the characters it reads from the script file. I assume the original file just has \, so where are the extra coming from?

After poking around it appears that git performed automatic line ending conversion when I cloned your repository. In other words my test script automatically got changed to be subtly different from your test script, and this broke the test.

Knowing this, I don't think that my suggested patch is correct, because it depends on 'git' changing the test script. Git (and other tools like it) are the culprit here. What if the repository was checked out without using autocrlf? I suspect that this will be an issue that other people will stumble over.

I guess one option would be to change the Parser so that it converts line endings to system independent form when it reads the string in (e.g. for Windows change \\ to \, for classic MacOS change \ to \). This could be implemented with a

input.replace(System.getProperty("line.separator"),"\n");

inside of BSHLiteral#stringSetup. But I don't think this is correct either.

I don't think that there is a perfect solution to this problem, because this kind of automagic would thwart people from deliberately forming the line endings that they wanted, and even so there would likely be odd situations where scripts would break mysteriously.

My preference would be to not modify the input at all, but caveat emptor that the line endings in your script file will be the ones that are copied verbatim in to the long string. This is not a bug, and fixing it would cause more confusion.

As far as the test failing, the test could be changed to test alternate line endings and succeed if either one passed.

nickl- commented 5 years ago

Hmmm I don't think I quite follow with the git autocrlf and how that relates to this patch.

// Long strings
- out="\n";
+ out= System.getProperty("line.separator");
 for (x : 5)
-   out += x + "\n";
+   out += x + System.getProperty("line.separator");

Surely git had nothing to do with this... well I hope at least.

nickl- commented 5 years ago

Just stumbled on [System.lineSeparator()](https://docs.oracle.com/javase/10/docs/api/java/lang/System.html#lineSeparator()) since 1.7... now I know that too tyvm.

opeongo commented 5 years ago

Here I will explain the reasoning why git caused the test to fail on my system, and what could be done about it.

The problem

This is a long standing and well known issue for cross development between unix-like and windows systems.

I assume that you prepared the strings.bsh test script on a unix-like system. The line endings in the file suggest so. If on your unix-like system you open the file in a binary/hex editor (not a text editor) you will see that the byte codes for this section are

aout = """
0
1
2
3
4
5
""";

are composed of the byte sequence

61 6F 75 74 20 3D 20 22 22 22 0A 30 0A 31 0A 32 0A 33 0A 34 0A 35 0A 22 22 22 3B 0A

I am working on a Windows system. When I clone the repository I get the same packed deltas as you, but when I check out a branch to a working copy git by default will translate line endings in text files from 'canonical' format (\n) to Windows format (\r\n). Now on my Windows/git/autocrlf system that same section of code will have the bytes:

61 6F 75 74 20 3D 20 22 22 22 0D 0A 30 0D 0A 31 0D 0A 32 0D 0A 33 0D 0A 34 0D 0A 35 0D 0A 22 22 22 3B 0D 0A

Note that this sequence is 8 bytes longer, because there are 8 lines and an \r code (0d) has been added to each line ending.

The reason for doing this is because some (most?) text editors on Windows will insert \r\n line endings for new lines, and when save the file you will either get all lines endings with \r\n, or perhaps just the lines that you inserted with \r\n and the rest unchanged (the behaviour depends on the editor, there is no standard). This is the hellish legacy of Bill Gates. Git by default does the automatic conversion from canonical to system dependent on checkout and system dependent to canonical on checkin because usually it is the right thing to do. If it didn't then there might be a lot of thrashing between line ending styles each time the file is modified on a different system type, and this would make the change log unusable.

Getting back to the test failure, when checked out from git/autocrlf on a Windows system the test script will actually be different than the test script on a unix-like system: they will have different byte sequences and have different lengths.

When the parser tokenizes a long string it will read all bytes in between the opening and closing """. For the test in question a unix-like system would see 13 bytes, for a Windows/git/autocrlf system 20 bytes. When the test constructs the comparison string it is manually using a single \n for a line ending, and making a sequence that is 13 bytes long. This compares exactly to the sequence read from the unix-like system, but not to the Windows/git/autocrlf sequence.

So, what to do about this?

There are a two options I can see

Ignore this feature, it is not a bug

Is it a bug? I am not sure. The long string is capturing everything in between the delimiters. Perhaps the system dependent line endings are important for a particular use case: who can say for sure. What if the long string had explicit line endings:

aout="""
this\\r\\ncould\\r\\n\\happen\\r\\n
""";

If we wanted to allow people full control to make whatever long strings they wanted then we should just leave it as is. If this is the case then change the test so that the error goes away.

Sanitize the input

The other option is to sanitize the input, just like git-autocrlf: when the parser tokenizes a long string convert all of the \r\n sequences to \n. Sounds easy. This auto-magic would make the test failure go away.

But should this be done on Windows only where these mixups can occur, or should it be done on automatically on all platforms? And what about cases where different line ending sequences are required parts of the application data? Automatically sanitize the inputs may end up doing the wrong thing in some cases.

A third way

Actually I can think of a third option:

keep the current long string behaviour. The """ long string delimiter will copy all bytes unchanged.
add another long string type, using ''' as a delimiter. This type will convert line endings to canonical form.

Nah... that is getting too complicated for too little benefit.

After thinking about it my vote would go for the second option: automatically sanitize the inputs by changing all \r\n to \n. What is the common case for long strings anyway? Surely not to do exotic things with control characters. If there was a use case to embed different line endings this could be done using a more traditional means (StringBuilder) or with String.replace().

nickl- commented 5 years ago

I vote for option 1 "Ignore this feature, it is not a bug" the problem comes from changing the lite endings which is not something that should ever occur under normal circumstances. If anyone managed to change the line endings they should know how to change it back too.

The other issue which should be considered is what if you intended to have \r\n and we now go and remove it. Whomever caused the problem is the one responsible to fix it.

opeongo commented 5 years ago

I dug around a bit more and I found JEP 327 "Raw String Literals".

It looks like this JEP moving forward and will soon be in Java. This clarifies how the official Java will handle raw string literals. Here are a couple of examples of how this will work

aout = `
0
1
2
3
5
`;

a = `this is a "raw string
literal".  note that \ characters 
get no special treatment`;

b = ``here is how to embed a ` single backtick``;

c = ```here is how to embed a `` double backtick```;

tl;dr

no escape processing within raw literals: '\n' is interpreted to the two characters '\' and 'n'
automatic line ending sanitization: line endings with \\ are converted to \
the delimiter is a backtick (actually a matching variable number of backticks)
- If there are no backticks within the literal then a single backticks could be used as delimiters.
- If sets of single backticks are used within the literal then double backticks can be used as delimiters
- etc

Note that they have taken a stand to normalize line endings. This effectively solves the issue that we have been discussing here.

This JEP also touches on the issues that we have been discussing in #495.

I think a review of BeanShell handles string and character literals is required in light of this JEP.

nickl- commented 5 years ago

Agreed... investigate JEP 327

beanshell / beanshell

Add system independent properties for line and path separator #494

Codecov Report

The problem

So, what to do about this?

Ignore this feature, it is not a bug

Sanitize the input

A third way