eclipse / xtext

Eclipse Xtext™ is a language development framework
http://www.eclipse.org/Xtext
Eclipse Public License 2.0
767 stars 321 forks source link

Timeout on Tycho 4/5 builds #2895

Closed cdietrich closed 8 months ago

cdietrich commented 9 months ago

Tycho 4/5 builds regularly get killed due to hanging tests it seems needs to be investigated

cc @LorenzoBettini @szarnekow

https://ci.eclipse.org/xtext/job/xtext/job/cd_tycho40/ https://ci.eclipse.org/xtext/job/xtext/job/cd_tycho50/

cdietrich commented 9 months ago

SeveralEditorsQueuedBuildTest seems to be one culprit, SaveWithReconciliationQueuedBuildDataTest another one

LorenzoBettini commented 9 months ago

Only in Jenkins? Or also on GitHub Actions?

cdietrich commented 9 months ago

does this run on actions daily at all. not to MY knowledge

LorenzoBettini commented 9 months ago

I meant the workflows that ran the first time you pushed those branches.

cdietrich commented 9 months ago

this branches are month old and updated quite seldom only. but i usually dont check github action results for them

cdietrich commented 9 months ago

wonder if there is an option to get a jstack / thread dump on timeout

cdietrich commented 9 months ago

found https://lists.apache.org/thread/mc276n4z31kqomk2zdfm7ybw21k7bz4v

cdietrich commented 9 months ago

we might also have to set a test timeout. right now its the jenkins job that times out

LorenzoBettini commented 9 months ago

I don't know if that's the case, but the last time I saw a stuck job on Xtext GitHub Actions, which I had to cancel, was due to a small bug I introduced in my fork that caused Xtend renaming, throwing an exception. Of course, I could reproduce the bug locally: everything was stuck due to the Eclipse dialog showing the uncaught exception. With that dialog on the way, the tests did not proceed at all...

cdietrich commented 9 months ago

hang can be reproduced locally @szarnekow any idea a2.log what are we waiting for there

cdietrich commented 9 months ago

maybe org.eclipse.xtend.ide.tests.builder.JavaEditorExtension.waitForElementChangedEvent(int, ()=>void) should get a timeout. but that would lead to flaky tests i guess

szarnekow commented 9 months ago

I'm pretty sure there should be an event since we change the content of the editor. Very surprising.

LorenzoBettini commented 9 months ago

If I understand correctly this problem shows up only since Tycho 4, right? And it is a race condition? I wonder if in Tycho they introduce some default configuration for parallel tests that break these tests.

cdietrich commented 9 months ago

saw it run as plugin test.

LorenzoBettini commented 9 months ago

saw it run as plugin test.

What do you mean? It runs fine when executed from Eclipse?

cdietrich commented 9 months ago

No it hang on the 10th attempt or so

LorenzoBettini commented 9 months ago

Maybe trying with org.eclipse.xtend.ide.tests.builder.JavaEditorExtension.VERBOSE set to true and see what happens? I cannot reproduce it locally.

By the way,

    def waitForElementChangedEvent(int eventMask, =>void producer) {
        if (VERBOSE) {
            println('''start waiting for an element changed event: «eventMask»''')
        }
        val changed = new AtomicBoolean(false)
        JavaCore.addElementChangedListener(
            [
                JavaCore.removeElementChangedListener(self)
                if (!changed.get) {
                    changed.set(true)
                    if (VERBOSE) {
                        println(it)
                    }
                }
            ], eventMask)
        producer.apply
        while (!changed.get) {
            if (Display.getCurrent() !== null) {
                while (Display.getDefault().readAndDispatch()) {
                    // process queued ui events
                }
            }
        }
        if (VERBOSE) {
            println('''end waiting for an element changed event: «eventMask»''')
        }
    }

maybe it should never be the case but if Display.getCurrent() is null we never exit that while loop.

cdietrich commented 9 months ago

log says start waiting for an element changed event: 1 with verbose. 15th attempt this time which tp do you use?

cdietrich commented 9 months ago

grafik

cdietrich commented 9 months ago

maybe in recent jdt something about reconciliation has changed that it never happens

LorenzoBettini commented 9 months ago

which tp do you use?

The latest. Reloaded this morning.

cdietrich commented 9 months ago

am also using latest

LorenzoBettini commented 9 months ago

What does the screenshot want to show? Is the program stuck there?

cdietrich commented 9 months ago

yes. so the event is never produced

LorenzoBettini commented 9 months ago

I cannot reproduce it on macOS m1 nor on a slower laptop with Linux...

On the screenshot, the console shows that the event has been caught and printed on the screen.

Could you please try this?

        while (!changed.get && Display.getDefault().readAndDispatch()) {
            if (VERBOSE) {
                println("readAndDispatch")
            }
        }
cdietrich commented 9 months ago

leads to test errors with

/*******************************************************************************
 * Copyright (c) 2024 itemis AG (http://www.itemis.eu) and others.
 * This program and the accompanying materials are made available under the
 * terms of the Eclipse Public License 2.0 which is available at
 * http://www.eclipse.org/legal/epl-2.0.
 * 
 * SPDX-License-Identifier: EPL-2.0
 *******************************************************************************/
package org.eclipse.xtend.ide.tests.builder;

import org.junit.runner.RunWith;
import org.junit.runners.Suite;
import org.junit.runners.Suite.SuiteClasses;

/**
 * @author dietrich - Initial contribution and API
 */
@RunWith(Suite.class)
@SuiteClasses({
    SeveralEditorsQueuedBuildTest.class,    
    SeveralEditorsQueuedBuildTest.class,    
    SeveralEditorsQueuedBuildTest.class,    
    SeveralEditorsQueuedBuildTest.class,    
    SeveralEditorsQueuedBuildTest.class,    
    SeveralEditorsQueuedBuildTest.class,    
    SeveralEditorsQueuedBuildTest.class,    
    SeveralEditorsQueuedBuildTest.class,    
    SeveralEditorsQueuedBuildTest.class,    
    SeveralEditorsQueuedBuildTest.class,    
    SeveralEditorsQueuedBuildTest.class,    
    SeveralEditorsQueuedBuildTest.class,    
    SeveralEditorsQueuedBuildTest.class,    
    SeveralEditorsQueuedBuildTest.class,    
    SeveralEditorsQueuedBuildTest.class,    
    SeveralEditorsQueuedBuildTest.class,    
    SeveralEditorsQueuedBuildTest.class,    
    SeveralEditorsQueuedBuildTest.class,    
    SeveralEditorsQueuedBuildTest.class,    
    SeveralEditorsQueuedBuildTest.class,    
    SeveralEditorsQueuedBuildTest.class,    
    SeveralEditorsQueuedBuildTest.class,    
    SeveralEditorsQueuedBuildTest.class,    
    SeveralEditorsQueuedBuildTest.class,    
    SeveralEditorsQueuedBuildTest.class,    
    SeveralEditorsQueuedBuildTest.class,    
    SeveralEditorsQueuedBuildTest.class,    
    SeveralEditorsQueuedBuildTest.class,    
    SeveralEditorsQueuedBuildTest.class,    
    SeveralEditorsQueuedBuildTest.class,    
    SeveralEditorsQueuedBuildTest.class,    
    SeveralEditorsQueuedBuildTest.class,    
    SeveralEditorsQueuedBuildTest.class,    
})
public class DullySuite {

}
LorenzoBettini commented 9 months ago

Yes, on this line (assertThereAreNotDeltas)

        barEditor.save
        queuedBuildDataContribution.unconfirmedDeltas.assertThereAreDeltas("mypackage.Bar")
        queuedBuildData.andRemovePendingDeltas.assertThereAreNotDeltas
LorenzoBettini commented 9 months ago

...but only sometimes on my computer (though I'm not using your suite)

cdietrich commented 9 months ago

see the hang also with the suite sometimes

LorenzoBettini commented 9 months ago

I managed to see the hang with your suite. Twice, it hangs in closeEditorWithChanges. If I close the welcome view at that point, the editor is not closed. If I manually close the editor, the suite goes on with that test failed at:

java.lang.AssertionError: There are not deltas
    at org.junit.Assert.fail(Assert.java:89)
    at org.junit.Assert.assertTrue(Assert.java:42)
    at org.eclipse.xtend.ide.tests.builder.AbstractQueuedBuildDataTest.assertThereAreDeltas(AbstractQueuedBuildDataTest.java:99)
    at org.eclipse.xtend.ide.tests.builder.SeveralEditorsQueuedBuildTest.closeEditorWithChanges(SeveralEditorsQueuedBuildTest.java:130)

I can't seem to be able to reproduce it if I manually close the welcome screen when the suite starts; could you please try as well?

cdietrich commented 9 months ago

thought we have a close welcome screen somewhere in a base class

cdietrich commented 9 months ago

does not help grafik

LorenzoBettini commented 9 months ago

@cdietrich I pushed my branch based on yours and merged with main.

I think it's better to do like what I was proposing (https://github.com/eclipse/xtext/commit/245619f63b88708a308657b0be020ddab1963426): the nested while is prone to infinite loop, I would say it's wrong from the concurrency/synchronization point of view. It's better to exit the while and let the test fail; we can rely on our configuration for re-running flaky tests. Let's see how the build is going.

I'm not familiar with those tests, but it looks like it's due to some platform/jdt synchronizations.

cdietrich commented 9 months ago

yes i hope for some input from @szarnekow

LorenzoBettini commented 9 months ago

On GitHub Actions with my change a flake is detected (https://pipelinesghubeus7.actions.githubusercontent.com/xwd7D2dTH7KjkIQFqrWM6jPOzIKPN7cIWrsK5HktZyQ1MMY4jS/_apis/pipelines/1/runs/1292/signedlogcontent/5?urlExpires=2024-01-09T17%3A27%3A18.7778195Z&urlSigningMethod=HMACV1&urlSignature=HdvChU33DWHSe%2F5nHRm4MK91p%2BI%2BSlUjBbrcbFMkphI%3D):

Flakes: 
org.eclipse.xtend.ide.tests.builder.SeveralEditorsQueuedBuildTest.saveSeveralEditorsOneByOne
Run 1: SeveralEditorsQueuedBuildTest.saveSeveralEditorsOneByOne:66->AbstractQueuedBuildDataTest.assertThereAreDeltas:99->Assert.assertTrue:42->Assert.fail:89 There are not deltas
Run 2: PASS

On Jenkins with your branch with Tycho 5 everything runs smoothly... did you change anything on that branch?

szarnekow commented 9 months ago

Note that the test was sane and should succeed in any case since we certainly do want to see a JDT event within a reasonable time budget. I'm not aware of any JDT changes that would explain why this should no longer be the case.

szarnekow commented 9 months ago

I see that the linked commit https://github.com/eclipse/xtext/issues/2895#:~:text=I%20was%20proposing%20(-,245619f,-)%3A%20the%20nested%20while uses Display.getDefault yet it's still failing as flaky. So it's unlikely that Tycho 4 uses a different thread for the display.

LorenzoBettini commented 9 months ago

On Jenkins it succeeds: https://ci.eclipse.org/xtext/job/xtext/job/lb_tycho40_2895/1/console

The nested while though can lead to an infinite loop as was happening in a few CI runs. I mean, if there's a problem in JDT, with the current shape, if we don't get a JDT event the code loops forever. With a single while, the test fails without looping forever. If I understand it correctly.

LorenzoBettini commented 9 months ago

I believe that this is really related to a bug in platform, not our fault.

As I said, when I see the test hang, or fail with my commit https://github.com/eclipse/xtext/commit/245619f63b88708a308657b0be020ddab1963426, the dirty editor is not effectively closes, despite the call to close editor returned correctly; the editor is still there and usable. The same holds when the hanging or failing test programmatically saves an editor: you can see that the editor still appears dirty.

I've been hit by the same problem in my Xtext development workspace with 2023-04 M1: I open an editor, make some changes, and save, and the editor is still dirty. If I close the editor, I'm asked to save the changes, and I accept. I reopen the file, and the file has not been saved. Unfortunately, I constantly see that every day on the first opened file after I started Eclipse :(

In any case, I think we should protect our UI tests from such bugs, avoiding the test to hang the CI. Note that when it hangs, it is not simply in a deadlock: it's eating all the resources since it continuously executes the while loop, so it's a critical problem.

My original proposal makes the test too flaky, I agree; but I can propose an alternative solution that at least does not hang and fails after some attempts. When we hit the problem, the test becomes flaky, but it's less likely, and at least the build will try to re-execute that.

LorenzoBettini commented 9 months ago

Besides what I wrote above, it's not a Tycho 4/5 problem anymore: it officially appears also on the main branch: https://ci.eclipse.org/xtext/job/xtext/job/main/820/console:

08:08:22  Running org.eclipse.xtend.ide.tests.builder.SeveralEditorsQueuedBuildTest
12:44:40  Cancelling nested steps due to timeout
12:44:40  Sending interrupt signal to process