[releng] Windows-only test failure : com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 3 of 3-byte UTF-8 sequence.

| --- | --- | | Bugzilla Link | 538771 | | Status | NEW | | Importance | P3 normal | | Reported | Sep 07, 2018 04:26 EDT | | Modified | Sep 07, 2018 12:44 EDT | | Reporter | Ed Willink |

Description

While running Tycho builds interactively to pin down Bug 538600, the OCL build showed one test failure when run interactively (on Windows). It occurs on Nightly and Stable builds and does not occur on Jenkins (on Linux). Clearly nothing to do with Bug 538600, and unclear how long it has been a problem for.

Running org.eclipse.ocl.examples.test.xtext.AllXtextTests\ 10219 ERROR StandaloneProjectMap$MapToFirstConflictHandlerWithLog - Conflicting access to 'http://www.eclipse.org/emf/2002/Ecore' already accessed as 'platform:/plugin/org.eclipse.emf.ecore/model/Ecore.ecore'\ testCompleteOCLRoundTrip_UML skipped since 'platform:/resource/UML-2.5/XMI-5-Jan-2012/Semanticed UML.ocl' is missing.\ Tests run: 686, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 278.226 sec <<< FAILURE! - in org.eclipse.ocl.examples.test.xtext.AllXtextTests\ testPivot_oclstdlib_oclstdlib (org.eclipse.ocl.examples.test.xtext.PivotTests) Time elapsed: 5.391 sec <<< ERROR!\ org.eclipse.emf.ecore.resource.impl.ResourceSetImpl$1DiagnosticWrappedException: com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 3 of 3-byte UTF-8 sequence.\ at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:701)\ at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:435)\ at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1895)\ at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.scanLiteral(XMLEntityScanner.java:1187)\ at com.sun.org.apache.xerces.internal.impl.XMLScanner.scanAttributeValue(XMLScanner.java:987)\ at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanAttribute(XMLDocumentFragmentScannerImpl.java:1548)\ at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanStartElement(XMLDocumentFragmentScannerImpl.java:1315)\ at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2784)\ at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:602)\ at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:505)\ at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:842)\ at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771)\ at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)\ at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)\ at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643)\ at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:327)\ at org.eclipse.emf.ecore.xmi.impl.XMLLoadImpl.load(XMLLoadImpl.java:175)\ at org.eclipse.emf.ecore.xmi.impl.XMLResourceImpl.doLoad(XMLResourceImpl.java:261)\ at org.eclipse.emf.ecore.resource.impl.ResourceImpl.load(ResourceImpl.java:1563)\ at org.eclipse.emf.ecore.resource.impl.ResourceImpl.load(ResourceImpl.java:1342)\ at org.eclipse.emf.ecore.resource.impl.ResourceSetImpl.demandLoad(ResourceSetImpl.java:259)\ at org.eclipse.emf.ecore.resource.impl.ResourceSetImpl.demandLoadHelper(ResourceSetImpl.java:274)\ at org.eclipse.emf.ecore.resource.impl.ResourceSetImpl.getResource(ResourceSetImpl.java:406)\ at org.eclipse.ocl.examples.xtext.tests.XtextTestCase.assertPivotIsValid(XtextTestCase.java:295)\ at org.eclipse.ocl.examples.test.xtext.PivotTests.doPivotTestOCLstdlib(PivotTests.java:259)\ at org.eclipse.ocl.examples.test.xtext.PivotTests.testPivot_oclstdlib_oclstdlib(PivotTests.java:361)

Results :

Tests in error: \ PivotTests.testPivot_oclstdlib_oclstdlib:361->doPivotTestOCLstdlib:259->XtextTestCase.assertPivotIsValid:295 » DiagnosticWrapped

Tests run: 686, Failures: 0, Errors: 1, Skipped: 0

By Ed Willink on Sep 07, 2018 05:43

Running Tycho under debug shows that the problem occurs at the unicode quotes in "word ‘result’ is" comment in /_OCL_PivotTests__testPivot_oclstdlib_oclstdlib/oclstdlib.oclas. A hex dump shows correct 0x2018/0x2019 coding in the file, but the 0xe2, 0x80, 0x98 UTF-8 tri-byte has been corrupted to 0xe2, 0x80, 0x3F that is indeed invalid.

Ah! looking at the properties of /_OCL_PivotTests__testPivot_oclstdlib_oclstdlib rather than /org.eclipse.ocl.examples.xtext.tests/models/oclstdlib/oclstdlib.oclas shows that we are in a Cp1252/Windows project. While the file may be UTF-8 deduced from content that may not necessarily match the Linux behaviour.

This failure may have existed since test projects were introduced for Tycho.

By Ed Willink on Sep 07, 2018 11:19

Adding the UTF-8, Windows-LF files to the test projects makes no difference, although it makes subsequent interactive use of the test projects a little easier.

Strangely, > a clue/confounding issue, the debugger insists on showing the Java 5 version of UTF8Reader.class even though the Java 8 version has consistent line numbering.

Debugging the UTFReader.read(ch[], offset, length) more carefull the ch[] contains the correctly decoded text, and it is a recall with the same buffer starting at offset 0 that is causing the problem.

A bit higher up the stack XMLEntityScanner.scanLiteral warns about short term validity of buffers, so we might look to a chnage in some parser code, but the problem code all seems to be part of Java 8 and an OCL Stable build of the "6.4.0RC4" tag does not show the problem although using the same Java 8 and the same host Eclipse workspace.

Examining where the breakage occurred, it is all a bit simpler.

commit 48deb5dab10499b3da5550952390a589f2241dd8 (M0.65) ok\ commit 9cb5db3e5cd57b12afcd03a56a40ef823a2b0d68\ The new test testLoad_Bug535712_ocl fails with an HTTP 405 demonstrating Bug 535712\ commit 7e0e55545bca39f6c334c91ba8b1a2b7d3932e5b\ Fixes Bug 535712 but gives a FNFE on the oclstdlib input\ commit b3e2f0c1780d0bb7a13a7f463fdc9f81d0fce2ec\ Fixes the FNFE but we now have the undetected Windows/Linux maven-surefire inconsistency

We 'just' need to see whether the inconsistency is in the XMLParser, EMF config thereof, or Pivot/Test Harness config thereof.

By Ed Willink on Sep 07, 2018 12:44

Oops. From the FileWriter Javadoc:

Convenience class for writing character files. The constructors of this
class assume that the default character encoding and the default byte-buffer
size are acceptable. To specify these values yourself, construct an
OutputStreamWriter on a FileOutputStream.

JUnitStandaloneTestProject.getOutputFile(@NonNull String testFilePath, @NonNull InputStream inputStream)

copies from an InputStream, correctly using an InputStreamReader but without specifying an encoding for the FileWriter so we get a default.

Replacing

Writer writer = new FileWriter(...);

Writer writer = new OutputStreamWriter(new FileOutputStream(...), "UTF-8");

solves the problem.

Policy. For the test files all our test projects are UTF-8 so imposing UTF-8 is not wrong.

Reviewing all "new FileWriter(" calls.

Numerous script output files. e.g. GenerateLaTeXForASModel generating a "*.tex" file. There is a very strong expectation that all normal output with be simple ASCII, but there might be arbitrary comments. If we are really clever we could analyze the container ptoject and specify a weird charset. Since we generate to 'our' project we 'know' that it is "UTF-8". Therefore better to always impose UTF-8 for generated files.

When does it matter?

a) if we edit a *.ocl file with an Xtext editor we should respect the user's preference - Xtext's problem.

b) if we code generate e.g. a *.java file, we should respect the user's preference - ouch.

c) if we save as e.g. *.oclas, we should respect the user's preference - ouch.

This is similar to the new-line policy problem for which EMF sometimes try to inherit something, but not consistently. Imposing UTF-8 and Unix-NL on all generators seems much closer to always right and certainly avoids OS / user-workspace dependencies.

Long term, replacing new FileWriter by new UTF8FileWriter would be nice, but we cannot really introduce a new class at RC2. This is only an interactive test bug, we do not need to fix it for 2018-09. Add the new class in 2018-12.

eclipse-ocl / org.eclipse.ocl

[releng] Windows-only test failure : com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 3 of 3-byte UTF-8 sequence. #2000

Description