bndtools / bnd

Bnd/Bndtools. Tooling to build OSGi bundles including Eclipse, Maven, and Gradle plugins.
https://bndtools.org
Other
532 stars 305 forks source link

[quickfix] Integration test is flaky #4253

Closed kriegfrj closed 3 years ago

kriegfrj commented 4 years ago

There appears to be a race condition in the startup of the integration test that causes the beforeAll() method of the integration test to fail to get a local repo:

https://github.com/bndtools/bnd/blob/b4cddedfe8f233e1005828d00f3c83a1345611bc/bndtools.core.test/src/bndtools/core/test/editors/quickfix/BuildpathQuickFixProcessorTest.java#L77-L91

This needs to be fixed.

bjhargrave commented 4 years ago

This is still happening. https://github.com/bndtools/bnd/pull/4278/checks?check_run_id=1020131734#step:5:587

bjhargrave commented 4 years ago

https://github.com/bndtools/bnd/runs/1031734002?check_suite_focus=true#step:5:743

kriegfrj commented 4 years ago

Yes, I've been keeping an eye on it while working on #4279. Recording some observations to aid further investigation:

  1. It (or similar problems) happened occasionally even while continuous testing. When continuous testing the quick fix processor, the core Eclipse and Bndtools don't restart. So whatever the problem is, it probably doesn't have to do with Bndtools/JDT startup issues.
  2. I managed to dump a snapshot of the workspace resource hierarchy at one point when it happened - it looked normal.

So I think the issue lies somewhere in the code that initialises the Eclipse projects. The WorkbenchExtension creates a temp directory to use as an empty workspace at startup, and then the WorkspaceImporter is used to import the template workspace from the resources folder into the temporary workspace to work on. It seems that sometimes, for some reason, Bndtools is not picking up the changes to the workspace. Sometimes this results in bndProject coming back null, and sometimes it results in the LocalIndexedRepo coming back null.

One theory is that despite the synchronisation efforts, the attempt to fetch the Bnd config is a bit too early and the workspace isn't ready. Bndtools caches the config and then doesn't notice when the workspace is updated.

Any other suggested avenues of investigation welcomed.

bjhargrave commented 4 years ago

https://github.com/bndtools/bnd/pull/4285/checks?check_run_id=1049644892#step:5:621

bjhargrave commented 4 years ago

https://github.com/bndtools/bnd/runs/1102242709#step:5:613

kriegfrj commented 4 years ago

Has this disappeared? Or have you simply given up reporting when it happens?

bjhargrave commented 4 years ago

Has this disappeared?

No.

Or have you simply given up reporting when it happens?

I forget. You can look in https://github.com/bndtools/bnd/actions?query=workflow%3A%22CI+Build%22 and most of the failed builds are because of this issue. But sometimes, I rerun the jobs since I need a clean build to publish the artifacts.

bjhargrave commented 4 years ago

https://github.com/bndtools/bnd/pull/4342/checks?check_run_id=1215372978#step:5:596

kriegfrj commented 4 years ago

Wishful thinking then... :) I've been thinking about this one and i think i might know where to start looking...

bjhargrave commented 4 years ago

I updated the testOSGi task to ignore failures as this test case is failing too much and driving me mad!

bjhargrave commented 3 years ago

I disabled the testOSGi task in the CI build.

kriegfrj commented 3 years ago

The latest failures are interesting:

Execution failed for task ':bndtools.core.test:testOSGi'.

java.io.EOFException: No input provided for project

This looks like a meta-failure rather than a failure of the test cases themselves. It would be nice if we had a stack trace of that exception.

bjhargrave commented 3 years ago

That exception is from inside Gradle. Run the command with -s and it will show the stacktrace. It is caused because the test vm fails to send any output. I think that perhaps my adding ignoreFailures = true has allowed gradle to continue to the point where it then triggers the eof exception.

wborn commented 3 years ago

java.io.EOFException: No input provided for project

It would be nice if we had a stack trace of that exception.

We often run into these exceptions when running integrations tests with parallel Maven builds on Jenkins, see this job.

I just ran a build with debug logging -X and this is the stacktrace showing:

[ERROR] Failed to execute goal biz.aQute.bnd:bnd-testing-maven-plugin:5.2.0:testing (default) on project org.openhab.core.model.item.tests: No input provided for project -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal biz.aQute.bnd:bnd-testing-maven-plugin:5.2.0:testing (default) on project org.openhab.core.model.item.tests: No input provided for project
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:215)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117)
    at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call (MultiThreadedBuilder.java:190)
    at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call (MultiThreadedBuilder.java:186)
    at java.util.concurrent.FutureTask.run (FutureTask.java:264)
    at java.util.concurrent.Executors$RunnableAdapter.call (Executors.java:515)
    at java.util.concurrent.FutureTask.run (FutureTask.java:264)
    at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1128)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:628)
    at java.lang.Thread.run (Thread.java:834)
Caused by: org.apache.maven.plugin.MojoExecutionException: No input provided for project
    at aQute.bnd.maven.testing.plugin.TestingMojo.execute (TestingMojo.java:160)
    at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (DefaultBuildPluginManager.java:137)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:210)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117)
    at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call (MultiThreadedBuilder.java:190)
    at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call (MultiThreadedBuilder.java:186)
    at java.util.concurrent.FutureTask.run (FutureTask.java:264)
    at java.util.concurrent.Executors$RunnableAdapter.call (Executors.java:515)
    at java.util.concurrent.FutureTask.run (FutureTask.java:264)
    at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1128)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:628)
    at java.lang.Thread.run (Thread.java:834)
Caused by: java.io.EOFException: No input provided for project
    at org.apache.tools.ant.Project.defaultInput (Project.java:1293)
    at org.apache.tools.ant.Project.demuxInput (Project.java:1315)
    at org.apache.tools.ant.DemuxInputStream.read (DemuxInputStream.java:73)
    at aQute.lib.io.IO.copy (IO.java:183)
    at aQute.libg.command.Command.execute (Command.java:167)
    at aQute.bnd.build.ProjectLauncher.launch (ProjectLauncher.java:374)
    at aQute.launcher.plugin.ProjectLauncherImpl.launch (ProjectLauncherImpl.java:197)
    at aQute.tester.junit.platform.plugin.ProjectTesterImpl.test (ProjectTesterImpl.java:99)
    at aQute.bnd.build.Project.test (Project.java:2520)
    at aQute.bnd.maven.testing.plugin.TestingMojo.lambda$getOperation$1 (TestingMojo.java:207)
    at aQute.bnd.maven.lib.resolve.BndrunContainer.execute (BndrunContainer.java:179)
    at aQute.bnd.maven.testing.plugin.TestingMojo.execute (TestingMojo.java:157)
    at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (DefaultBuildPluginManager.java:137)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:210)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117)
    at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call (MultiThreadedBuilder.java:190)
    at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call (MultiThreadedBuilder.java:186)
    at java.util.concurrent.FutureTask.run (FutureTask.java:264)
    at java.util.concurrent.Executors$RunnableAdapter.call (Executors.java:515)
    at java.util.concurrent.FutureTask.run (FutureTask.java:264)
    at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1128)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:628)
    at java.lang.Thread.run (Thread.java:834)
bjhargrave commented 3 years ago

This is caused by some kind-of-broken code in ant:

https://github.com/apache/ant/blob/635ccfb5e4316be952241aae26cc36f3aa38f79b/src/main/org/apache/tools/ant/Project.java#L1302-L1304

InputStream.read should return -1 at end of stream, not EOFException. And Bnd is not expecting EOFExcepion for the purpose of reaching end of stream.

wborn commented 3 years ago

I did some more debugging of this annoying exception and it seems that in our case System.in is often (not always) assigned a DemuxInputStream after running the spotbugs-maven-plugin:spotbugs goal in our Maven reactor. Normally System.in is a java.io.BufferedInputStream. I'll see if it is still an issue with the latest version of that plugin or if I can workaround it.

wborn commented 3 years ago

The aQute.lib.io.IO.copy also sometimes has issues with java.io.BufferedInputStream in multi threaded Maven builds:

[ERROR] Failed to execute goal biz.aQute.bnd:bnd-testing-maven-plugin:5.2.0:testing (default) on project org.openhab.core.model.item.tests: Stream closed -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal biz.aQute.bnd:bnd-testing-maven-plugin:5.2.0:testing (default) on project org.openhab.core.model.item.tests: Stream closed
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:215)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117)
    at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call (MultiThreadedBuilder.java:190)
    at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call (MultiThreadedBuilder.java:186)
    at java.util.concurrent.FutureTask.run (FutureTask.java:264)
    at java.util.concurrent.Executors$RunnableAdapter.call (Executors.java:515)
    at java.util.concurrent.FutureTask.run (FutureTask.java:264)
    at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1128)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:628)
    at java.lang.Thread.run (Thread.java:834)
Caused by: org.apache.maven.plugin.MojoExecutionException: Stream closed
    at aQute.bnd.maven.testing.plugin.TestingMojo.execute (TestingMojo.java:160)
    at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (DefaultBuildPluginManager.java:137)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:210)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117)
    at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call (MultiThreadedBuilder.java:190)
    at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call (MultiThreadedBuilder.java:186)
    at java.util.concurrent.FutureTask.run (FutureTask.java:264)
    at java.util.concurrent.Executors$RunnableAdapter.call (Executors.java:515)
    at java.util.concurrent.FutureTask.run (FutureTask.java:264)
    at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1128)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:628)
    at java.lang.Thread.run (Thread.java:834)
Caused by: java.io.IOException: Stream closed
    at java.io.BufferedInputStream.getBufIfOpen (BufferedInputStream.java:176)
    at java.io.BufferedInputStream.read (BufferedInputStream.java:342)
    at aQute.lib.io.IO.copy (IO.java:183)
    at aQute.libg.command.Command.execute (Command.java:167)
    at aQute.bnd.build.ProjectLauncher.launch (ProjectLauncher.java:374)
    at aQute.launcher.plugin.ProjectLauncherImpl.launch (ProjectLauncherImpl.java:197)
    at aQute.tester.junit.platform.plugin.ProjectTesterImpl.test (ProjectTesterImpl.java:99)
    at aQute.bnd.build.Project.test (Project.java:2520)
    at aQute.bnd.maven.testing.plugin.TestingMojo.lambda$getOperation$1 (TestingMojo.java:207)
    at aQute.bnd.maven.lib.resolve.BndrunContainer.execute (BndrunContainer.java:179)
    at aQute.bnd.maven.testing.plugin.TestingMojo.execute (TestingMojo.java:157)
    at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (DefaultBuildPluginManager.java:137)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:210)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117)
    at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call (MultiThreadedBuilder.java:190)
    at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call (MultiThreadedBuilder.java:186)
    at java.util.concurrent.FutureTask.run (FutureTask.java:264)
    at java.util.concurrent.Executors$RunnableAdapter.call (Executors.java:515)
    at java.util.concurrent.FutureTask.run (FutureTask.java:264)
    at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1128)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:628)
    at java.lang.Thread.run (Thread.java:834)

It might be an idea to catch these exceptions in IO.copy when they occur?

wborn commented 3 years ago

I've injected my own stream in System.in to see what code is sometimes closing it and it seems to be bnd itself:

java.lang.RuntimeException: !!! CLOSED !!!
    at org.openhab.tools.analysis.report.MyBufferedInputStream.close(MyBufferedInputStream.java:16)
    at aQute.lib.io.IO.close(IO.java:951)
    at aQute.lib.io.IO.close(IO.java:960)
    at aQute.libg.command.Command.execute(Command.java:188)
    at aQute.bnd.build.ProjectLauncher.launch(ProjectLauncher.java:374)
    at aQute.launcher.plugin.ProjectLauncherImpl.launch(ProjectLauncherImpl.java:197)
    at aQute.tester.junit.platform.plugin.ProjectTesterImpl.test(ProjectTesterImpl.java:99)
    at aQute.bnd.build.Project.test(Project.java:2520)
    at aQute.bnd.maven.testing.plugin.TestingMojo.lambda$getOperation$1(TestingMojo.java:207)
    at aQute.bnd.maven.lib.resolve.BndrunContainer.execute(BndrunContainer.java:179)
    at aQute.bnd.maven.testing.plugin.TestingMojo.execute(TestingMojo.java:157)
    at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:137)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:210)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:156)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:148)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:117)
    at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:190)
    at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:186)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)

This code might not be thread safe when other builder threads change System.in:

https://github.com/bndtools/bnd/blob/a211ec952529d2cf70ef7155f379064cd5c0457f/aQute.libg/src/aQute/libg/command/Command.java#L188-L189

bjhargrave commented 3 years ago

This code might not be thread safe when other builder threads change System.in

Lets move this discussion to #4509

bjhargrave commented 3 years ago

It is possible the fix in https://github.com/bndtools/bnd/pull/4503 may have solved the flakiness issue. I have not seen the test fail on my system in a while. Lets still leave this open for now.

kriegfrj commented 3 years ago

I had noticed that this had improved, but i doubt it was #4503 - the old flakiness was happening at startup before any tests start running, while the fix in #4503 doesn't kick in until after the tests start running. But i did make some other changes that may have helped this issue to settle. I'd been meaning to ask has it been fixed for a while?

kriegfrj commented 3 years ago

Have you seen any failures recently, @bjhargrave ?

bjhargrave commented 3 years ago

Have you seen any failures recently, @bjhargrave ?

It has been stable on my local system for a while. I think that after we finish 5.3.0, we can re-enable this test in the CI build.

kriegfrj commented 3 years ago

It has been stable on my local system for a while. I think that after we finish 5.3.0, we can re-enable this test in the CI build.

I'm pretty sure it was #4404 that stabilized this issue. There was a race condition with the importing of the workspace that happens during @BeforeAll - if the builder started to build before the cnf directory was successfully imported the build would have NPE'd and not finished. But usually the import of cnf would finish before the build got that far and so it would proceed as normal.

bjhargrave commented 3 years ago

I enabled the test in the CI build and it failed. https://github.com/bndtools/bnd/actions/runs/549787921

kriegfrj commented 3 years ago

Ahh, failed with a different issue - it seems to be a variant of #4503 has re-appeared.

kriegfrj commented 3 years ago

While working on #4553 I noticed this old failure re-appear again. 😞 At least #4404 seems to have significantly reduced its frequency, so perhaps the root cause is related or along those lines.

https://github.com/kriegfrj/bnd/runs/1906814721#step:5:1177

kriegfrj commented 3 years ago

Ok, thanks to the intensive debugging I did to uncover #4573, I think I have also uncovered the source of this intermittent test startup failure.

Test operation - clean workspace

Ideally, in order to have each tests start from a clean slate, you would set up a new Eclipse workspace for each test class run. Unfortunately, Eclipse cannot dynamically change its workspace directory at run time, so the only way to do this would be to restart Eclipse for each test class. This would slow down the test significantly.

So instead, I elected to do the next closest thing that I could: the same workspace directory is re-used for all tests during the run, in generated/workspace, but at the start of each test class it is wiped clean by the Jupiter extension WorkbenchExtension: https://github.com/bndtools/bnd/blob/b53c3b8568d7bc6800a74a05685c523c303d20c1/bndtools.core.test/src/bndtools/core/test/utils/WorkbenchExtension.java#L101-L108

The extension then imports the projects from the template bnd workspace (in the resources directory) into the active workspace in generated/workspace, before handing off to the main test class to do its thing. The template workspace includes the cnf directory.

So the important thing about this for the current issue is: when the CI first starts the test, the generated/workspace directory will not have a cnf directory.

CnfWatcher

The startup race condition

There are two possible startup orders that can yield the observed results:

  1. Workspace initialization when there is a cnf project: in this scenario, Central.getWorkspace() is called for the first time when the cnf project has already been imported and has not yet been deleted. The tester thread is normally starts early and so this is what normally happens. In this case, the memoized Workspace object will correspond to configuration in the cnf project that existed at that point in time.
  2. Workspace initialization when there is no cnf project: in this scenario, Central.getWorkspace() is called for the first time when there is no cnf project in the workspace. In this scenario:
    • When Workspace is first memoized, it reflects a missing cnf project (eg, no repositories).
    • Because I'm relying on autobuild, the first build often starts after the WorkbenchExtension finishes importing the cnf project.
    • This means that BndtoolsBuilder has not yet been loaded, and so CnfWatcher has not been installed.
    • This means that when the cnf project is subsequently imported, there is no watcher installed yet to react to the change.
    • Hence nothing triggers Central to re-initialize the Workspace object once the cnf project is finished importing.
    • Hence, the Workspace object continues to reflect a missing cnf project.

This also explains why I didn't see it often when testing locally (either from within Eclipse or from Gradle) - I didn't make a habit of cleaning my generated/workspace directory when running locally, so at startup there was usually a cnf project already in the workspace left over from the last run. The window where you may have been missing a cnf project (between deleting the pre-existing cnf project and importing the template again) was substantially shorter than if you started from an empty workspace. In CI, you're invariably starting from an empty workspace, so that window where there is no cnf project is substantially wider and more likely to trigger this race condition.

Having identified the problem, I think I can solve it simply by ensuring that BndtoolsBuilder is loaded before the cnf project is imported. Or by another (possibly more elegant) means.

kriegfrj commented 3 years ago

Having identified the problem, I think I can solve it simply by ensuring that BndtoolsBuilder is loaded before the cnf project is imported. Or by another (possibly more elegant) means.

Ok, my "more elegant means" is to manually call clear(), forceRefresh() and getPlugins() on the bnd Workspace object after the import is complete. And by batching this along with all of the import operations into a WorkspaceJob with a scheduling rule that locks the entire Eclipse workspace, I can fix #4573 at the same time as this import/refresh job will not be able to run concurrently with FlushJob, avoiding the possibility of deadlock.

I have this fix ready to submit as a PR, pending the merge of #4569.

pkriens commented 3 years ago

Nice work. I started to work on rewriting the builder and refresh in Eclipse since it is a mess written by people that had no clue (which prominently includes me). It will be nice to create test cases for this. For the first time :-)

bjhargrave commented 3 years ago

I have this fix ready to submit as a PR, pending the merge of #4569.

Merged.

kriegfrj commented 3 years ago

Having identified the problem, I think I can solve it simply by ensuring that BndtoolsBuilder is loaded before the cnf project is imported. Or by another (possibly more elegant) means.

Ok, my "more elegant means" is to manually call clear(), forceRefresh() and getPlugins() on the bnd Workspace object after the import is complete. And by batching this along with all of the import operations into a WorkspaceJob with a scheduling rule that locks the entire Eclipse workspace, I can fix #4573 at the same time as this import/refresh job will not be able to run concurrently with FlushJob, avoiding the possibility of deadlock.

I have this fix ready to submit as a PR, pending the merge of #4569.

Just for the record - the fix above hit similar deadlock issues. The fix that I settled on (currently awaiting merge in #4579) was slightly different:

By waiting for the workspace object to be ready before handing off to the test code, we're guaranteed that the LocalIndexedRepo will also be ready when we try and fetch it.

Seems to be stable now, so perhaps this issue can finally be put to bed and we can start working on other integration tests...

kriegfrj commented 3 years ago

Ok, as @bjhargrave reported here: https://github.com/bndtools/bnd/pull/4579#issuecomment-824351582, this issue is still not quite resolved.

I've had a look at the code again, and I think the issue might be here: https://github.com/bndtools/bnd/blob/6fc2e52edccfe28fde1c8a7f4d36caac1dd28359/bndtools.core.test/src/bndtools/core/test/utils/WorkspaceImporter.java#L94-L97

The WorkspaceImporter first deletes all projects in the workspace, then imports all the projects from the new workspace in preparation for the next test. This could actually result in onCnfWorkspace() being called twice - once for the delete of the old cnf project, and once for the import of the new cnf project. The intent is for this code to block until the import has finished and the Workspace object is fully initialised, but it may only end up blocking until the delete has finished. This leaves a race condition if the import doesn't manage to finish before the attempt to access the Local Index. Need to think of a way around this.

bjhargrave commented 3 years ago

This appears to be a consequent of reusing the running Eclipse instance/workspace across tests cases. If each test case started with its own non-shared eclipse instance and workspace, then no such issues could occur.