eclipse-ocl / org.eclipse.ocl

Eclipse Public License 2.0
0 stars 0 forks source link

[tests] Intermittent Xtext test non-failure failure #1948

Closed eclipse-ocl-bot closed 2 hours ago

eclipse-ocl-bot commented 2 hours ago

| --- | --- | | Bugzilla Link | 529797 | | Status | RESOLVED FIXED | | Importance | P3 normal | | Reported | Jan 14, 2018 06:19 EDT | | Modified | Apr 03, 2018 14:33 EDT | | Depends on | 529928, 529929 | | Blocks | 526252 | | See also | 529886 | | Reporter | Ed Willink |

Description

Since adding the OCL Nature, the Tycho Xtext tests often fail. All tests pass but the overall verdict is a failure with a -1073740771 exit code, which is probably 0xc0020001 User Callback Exception after application Close.

Initial investigation identified a "The Plug-in Development Environment requires the PlatformAdmin service to operate. Please install the compatibility fragment 'org.eclipse.osgi.compatibility.state'." message in the target/work/data/.metadata/.log. This outstanding problem, Bug 526252, was fixed but no change.

Suspicion: some worker thread is failing to shut down in a tidy fashion inhibiting an overall test success. The new OCL Builder/MultiValidationJob is the obvious suspect, but no tests should actually cause them to start....

eclipse-ocl-bot commented 2 hours ago

By Ed Willink on Jan 14, 2018 06:44

Introducing a dummy final test conforms that no MultiValidationJob was created and after a 10 second idle loop allowing the UI to thrash, the end result is still intermittent (ok once, failed once).

eclipse-ocl-bot commented 2 hours ago

By Ed Willink on Jan 14, 2018 07:51

Using the 10 second idle to interact with the workspace to put up a Progress View, it is clear that there is no unexpected mega-UML validation or similar that remains active.

Performing a normal JUnit plugin launch shows that the workspace exits with unsaved changes because of the many editor tests that just stop, but if this is the problem it should be a consistent fail.

Repeating:\ Nightly local launch is flaky, usually fails\ Stable local launch is 'solid'; four successive successes

Need to isolate the confounding dependency

eclipse-ocl-bot commented 2 hours ago

By Ed Willink on Jan 14, 2018 11:15

(In reply to Ed Willink from comment #2)

Need to isolate the confounding dependency

Platform photon/2017092210004/I-builds - no difference\ EMF - photon/201709221000/nightly - no difference\ UML2 - consistently M0\ Xtext - photon/201709221000 / nightly - repeated failures with nightly

Looks like an Xtext issue. Trimming all non-Xtext tests and the build succeeds. How can a different test group influence the outcome?

eclipse-ocl-bot commented 2 hours ago

By Ed Willink on Jan 14, 2018 13:51

Eliminating just the optional org.eclipse.ocl.examples.uml25 test plugin is sufficient to make the test works.

When the OCL nature is enabled, org.eclipse.ocl.examples.uml25 has 62 errors and can take about 10 seconds to OCL validate.

6 of the errors are in UML.ecore that Xtext might look at without authorization.

*.ocl should no longer be analyzed by Xtext.

*.uml should never be analyzed by Xtext.

Why is Xtext 2.14 worse?

eclipse-ocl-bot commented 2 hours ago

By Ed Willink on Jan 15, 2018 10:39

Trimmin, we can eliminate all non-XText test activations, and the maven activations and all classic/IA plugins, and nearly all the XText-tests.

Once we keep just the FileNewWizardTest as the 'only' intersting UI test we get a stuck "Do you want to convert to an OCL project" and the following smelly stack trace

Running org.eclipse.ocl.examples.test.xtext.AllXtextTests\ 0 [main] DEBUG eclipse.xtext.ui.internal.Activator - Starting Xtext UI Core bundle.\ 2164 [main] DEBUG eclipse.xtext.ui.editor.XtextEditor - Creating Xtext Editor. Instance: [org.eclipse.ocl.xtext.oclinecore.ui.OCLinEcoreEditor@5e572b08]\ 2342 [main] DEBUG eclipse.xtext.ui.editor.XtextEditor - init:org.eclipse.ui.part.FileEditorInput(/test-project/Test.oclinecore)\ 2342 [main] DEBUG eclipse.xtext.ui.editor.XtextEditor - doSetInput:org.eclipse.ui.part.FileEditorInput(/test-project/Test.oclinecore)\ 2342 [main] DEBUG eclipse.xtext.ui.editor.XtextEditor - Editor instance is [org.eclipse.ocl.xtext.oclinecore.ui.OCLinEcoreEditor@5e572b08]\ 2608 [main] DEBUG r.antlr.AbstractInternalAntlrParser - Parsing took: 47 ms\ 2608 [main] DEBUG linking.impl.AbstractCleaningLinker - beforeModelLinked took: 0ms\ 2623 [main] DEBUG linking.impl.AbstractCleaningLinker - doLinkModel took: 15ms\ 2858 [main] DEBUG se.xtext.util.OnChangeEvictingCache - Clear 1 cache entries for resource Test.oclinecore after 0 hits and 1 misses (quota: 0%)\ 2905 [main] DEBUG se.xtext.util.OnChangeEvictingCache - Clear 1 cache entries for resource Test.oclinecore after 0 hits and 1 misses (quota: 0%)\ 2905 [main] DEBUG se.xtext.util.OnChangeEvictingCache - Clear 1 cache entries for resource Test.oclinecore after 0 hits and 1 misses (quota: 0%)\ 2920 [main] DEBUG se.xtext.util.OnChangeEvictingCache - Clear 1 cache entries for resource Test.oclinecore after 1 hits and 1 misses (quota: 50%)\ 2936 [main] DEBUG se.xtext.util.OnChangeEvictingCache - Clear 1 cache entries for resource Test.oclinecore after 0 hits and 1 misses (quota: 0%)\ 2936 [main] DEBUG se.xtext.util.OnChangeEvictingCache - Clear 1 cache entries for resource Test.oclinecore after 0 hits and 1 misses (quota: 0%)\ 2936 [main] DEBUG se.xtext.util.OnChangeEvictingCache - Clear 1 cache entries for resource Test.oclinecore after 0 hits and 1 misses (quota: 0%)\ 2967 [main] DEBUG linking.impl.AbstractCleaningLinker - afterModelLinked took: 344ms\ 2998 [main] DEBUG eclipse.xtext.ui.editor.XtextEditor - Creating Xtext source viewer.\ 3389 [main] DEBUG xtext.ui.editor.model.XtextDocument - read - main\ 3411 [Worker-4] DEBUG xtext.ui.editor.model.XtextDocument - read - Worker-4\ 3411 [main] DEBUG xtext.ui.editor.model.XtextDocument - read - main\ 3426 [Worker-6] DEBUG xtext.ui.editor.model.XtextDocument - read - Worker-6\ 3426 [Worker-6] DEBUG se.utilities.PivotResourceValidator - Syntax check OK! Resource: platform:/resource/test-project/Test.oclinecore

Why are there Test.oclinecore cache entries and why are they missing?

eclipse-ocl-bot commented 2 hours ago

By Ed Willink on Jan 15, 2018 11:24

(In reply to Ed Willink from comment #5)

Once we keep just the FileNewWizardTest as the 'only' intersting UI test we get a stuck "Do you want to convert to an OCL project" and the following smelly stack trace

Fixing the dialog by adding the nature during the test does not change the stack tarce.

The OnChangeEvictingCache is a mandatory overhead of the LazyLinker inherited by CS2ASLinker. Need to understand what Xtext transactions are to eliminate it / use a NullImpl.

eclipse-ocl-bot commented 2 hours ago

By Ed Willink on Jan 15, 2018 18:09

With just the test_OCLstdlib_NewFileCreation to execute, the Tycho run tends to fails. Going in wuth the debugger suggests that the main thread is stalled waiting for the systemBundle to stop.

Exceptions such as NPE, IAE, ISE, AE do not seem to occur.

Somtimes a ValidationJob reports\ PivotResourceValidator - Syntax check FAIL! Resource: platform:/resource/test-project/Test.oclstdlib\ which shouldn't happen. The OCLNature/MultiValidationJob makes the ValidationJob obsolete.

But after a forced Eclipse restart after a cannot-delete lockup everything seems much better! Generally many of the trims to just one test can be reverted provided Eclipse is clean. ? Is some failure causing a stale Java thread to then cause everything to fail thereafter ?

eclipse-ocl-bot commented 2 hours ago

By Ed Willink on Jan 16, 2018 04:33

Very confusing, trimming to a bad test (DebuggerTests) then expanding falls foul of intermittency.

Problem only shows on Xtext 2.14 => Xtext 2.14 does something nasty\ Problem seems to require org.eclipse.ocl.examples.uml2 => suggests something nasty may involve building/validating bad models\ Problem seems fairly solid with DebuggerTests => a build probably wakes Xtext up to behave badly\ Intermittemcy seems to be compounded by the possibility that a preceding failure leaves a thread locking up access to some test resource.

Try again. Expand AllXtexts to as many tests as possible that run solidly, after a restart with no other Java tasks active. Then each test that cannot be added is something that provokes Xtext 2.14. Select the simplest, instrument and debug.

eclipse-ocl-bot commented 2 hours ago

By Ed Willink on Jan 16, 2018 08:23

(In reply to Ed Willink from comment #8)

Expand AllXtexts to as many tests as possible that run solidly

Proceding slowly after about 10 steps all are solid. The final step adding DebuggerTests is the problem. Examining the thread dump with VisualVM during the 60 second 'final' test does not show anything obviously bad. There is a "Press Finish" for the FileNewWizardTest. Perhaps Xtext 2.14 has made validation so slow that "Finish" is pressed too soon?

Runnimg just the DebuggerTest(s), oops, there is a "Convert to OCL" dialog. This could lock up. Eliminating it makes no difference. Running just DebuggerTests on Xtext 2.13 also stalls - maybe a different issue but needs fixing to make progress.

eclipse-ocl-bot commented 2 hours ago

By Ed Willink on Jan 16, 2018 10:10

(In reply to Ed Willink from comment #9)

DebuggerTests

Oops this was added on 4-May-2014 in commented out form and has NEVER been uncommented. It is therefore not surprising that it doesn't work. Bug 529886 raised.

eclipse-ocl-bot commented 2 hours ago

By Ed Willink on Jan 16, 2018 11:07

(In reply to Ed Willink from comment #8)

Intermittemcy seems to be compounded by the possibility that a preceding failure leaves a thread locking up access to some test resource.

After testing for a while with Xtext 2.13, similar phenomena occurred. It is therefore far from clear that there is any relevant difference between Xtext 2.13 and 2.14. All observations could be the result of a Tycho failure fouling all subsequent local Tycho runs until Eclipse is restarted. Individual failures may be the result of bad dialog pop-ups, dirty projects, ...

eclipse-ocl-bot commented 2 hours ago

By Ed Willink on Jan 16, 2018 13:56

Aargh! There is something hysterical.

After settling on the solution, suddenly it fails. Checking out a couple of commits that previously worked this morning and they fail. Not even restarting Windows fixes it. Reverting even earlier and it's ok. Onward again to the failure and it's still ok.

If something is using file/commit dates to selectively rebuild class files hysteresis might be possible, but Tycho is a full re-build. If a GIT checkout doesn't fetch all the files, that could explain it. (Currently using EGIT 4.9.2 of 201712040753.)

eclipse-ocl-bot commented 2 hours ago

By Ed Willink on Jan 17, 2018 05:02

(In reply to Ed Willink from comment #12)

If something is using file/commit dates to selectively rebuild class files hysteresis might be possible, but Tycho is a full re-build.

A cherry-pick can result in a negative transition on author-date, but actual file dates seem to steadily advance to match actual event time.

The hysteresis defeats me. Bug 529928 and Bug 529929 raised.

In order to make progress, perhaps we can identify a subset of the Xtext Tycho tests that pass solidly.

eclipse-ocl-bot commented 2 hours ago

By Ed Willink on Jan 17, 2018 17:09

Most of the UI Xtext tests are disabled for M5. Must be manually tested by JUnit plugin launches.

eclipse-ocl-bot commented 2 hours ago

By Ed Willink on Mar 21, 2018 09:42

(In reply to Ed Willink from comment #12)

Aargh! There is something hysterical.

Trying again, hopefully with some fixes in place.

The following could be successfully re-enabled, one at a time cumulatively

So maybe just CompletionProposalTests are not Tycho-=compatible.

So if there is an inexplicable exit, disable tests, with clean restarts before retrying to see if the right ones have been disabled.

eclipse-ocl-bot commented 2 hours ago

By Ed Willink on Mar 21, 2018 10:35

(In reply to Ed Willink from comment #15)

Disabling CompletionProposalTests again and we still get the inexplicable exit. Hysteresis. Restart Eclipse. Clean Rerun ok again.

A theory, partly based on earlier observations.

If the test code is badly coded so that for instance a "Add OCL Nature" popup stalls the test, which nonetheless is diagnosed via timeouts as a pass, following the test some thread may be stalled. If Tycho is also badly coded to fail to diagnose thread/access failures.

Initial failure, due to bad test, 'crashes' Tycho and leaves a stale thread holding a lock on some important file, e.g. surefire-reports/XXX.xml.

Thereafter, despite no bad tests remaining, the locked file 'crashes' Tycho once again during its write attempt giving no intelligible diagnostic.

The problem persists until some action allows/forces the stalled thread to exit.

eclipse-ocl-bot commented 2 hours ago

By Ed Willink on Mar 30, 2018 08:37

(In reply to Ed Willink from comment #15)

So maybe just CompletionProposalTests are not Tycho-=compatible.

Suspending the problem test seems to solve this problem, so we have a combination of a bad test, Bug 526252, and bad diagnosis, Bug 529928 and Bug 529929. This bug can therefore be closed.

eclipse-ocl-bot commented 2 hours ago

By Ed Willink on Apr 03, 2018 14:33

The need for a brutal kill threads for the OCL DebuggerTests seems to confirm the rogue thread lock theory.