Closed merks closed 2 years ago
Duplicate of #84
You can find some insights here:
Yes, look similar but there are lots of mergeUses
that don't appear blocked. Maybe the additional details here are helpful for the other issue...
Actually this is not blocking at all, its just a lots of heavy computation going on, so the actual outcome of the dump depends on when it was taken.
Yes, sometimes its doing this:
"Start Level: Equinox Container: b75ed7a6-8ac6-4973-b0db-b15f20689f61" #44 daemon prio=5 os_prio=0 cpu=2597312.50ms elapsed=2751.45s tid=0x000001f6bb22b8f0 nid=0x9974 runnable [0x00000012618fd000]
java.lang.Thread.State: RUNNABLE
at org.apache.felix.resolver.util.OpenHashMapList.deepClone(OpenHashMapList.java:37)
at org.apache.felix.resolver.Candidates.copy(Candidates.java:1143)
at org.apache.felix.resolver.ResolverImpl.permuteUsedBlames(ResolverImpl.java:1598)
at org.apache.felix.resolver.ResolverImpl.checkPackageSpaceConsistency(ResolverImpl.java:1464)
at org.apache.felix.resolver.ResolverImpl.checkPackageSpaceConsistency(ResolverImpl.java:1524)
at org.apache.felix.resolver.ResolverImpl.checkConsistency(ResolverImpl.java:621)
at org.apache.felix.resolver.ResolverImpl.findValidCandidates(ResolverImpl.java:574)
at org.apache.felix.resolver.ResolverImpl.doResolve(ResolverImpl.java:437)
at org.apache.felix.resolver.ResolverImpl.resolve(ResolverImpl.java:420)
at org.apache.felix.resolver.ResolverImpl.resolve(ResolverImpl.java:374)
at org.eclipse.osgi.container.ModuleResolver$ResolveProcess.resolveRevisions(ModuleResolver.java:1038)
at org.eclipse.osgi.container.ModuleResolver$ResolveProcess.resolveRevisionsInBatch(ModuleResolver.java:992)
at org.eclipse.osgi.container.ModuleResolver$ResolveProcess.resolve(ModuleResolver.java:909)
at org.eclipse.osgi.container.ModuleResolver.resolveDelta(ModuleResolver.java:172)
at org.eclipse.osgi.container.ModuleContainer.resolveAndApply(ModuleContainer.java:548)
at org.eclipse.osgi.container.ModuleContainer.resolve(ModuleContainer.java:503)
at org.eclipse.osgi.container.ModuleContainer.resolve(ModuleContainer.java:492)
at org.eclipse.osgi.container.Module.start(Module.java:446)
at org.eclipse.osgi.container.ModuleContainer$ContainerStartLevel$2.run(ModuleContainer.java:1847)
at org.eclipse.osgi.internal.framework.EquinoxContainerAdaptor$1$1.execute(EquinoxContainerAdaptor.java:136)
at org.eclipse.osgi.container.ModuleContainer$ContainerStartLevel.incStartLevel(ModuleContainer.java:1840)
at org.eclipse.osgi.container.ModuleContainer$ContainerStartLevel.incStartLevel(ModuleContainer.java:1781)
at org.eclipse.osgi.container.ModuleContainer$ContainerStartLevel.doContainerStartLevel(ModuleContainer.java:1745)
- locked <0x0000000700383580> (a java.lang.Object)
at org.eclipse.osgi.container.ModuleContainer$ContainerStartLevel.dispatchEvent(ModuleContainer.java:1667)
at org.eclipse.osgi.container.ModuleContainer$ContainerStartLevel.dispatchEvent(ModuleContainer.java:1)
at org.eclipse.osgi.framework.eventmgr.EventManager.dispatchEvent(EventManager.java:234)
at org.eclipse.osgi.framework.eventmgr.EventManager$EventThread.run(EventManager.java:345)
But after waiting 45 minutes so far, it feels like a non-terminating NP-complete process.
I guess only that one thread is doing something
"Start Level: Equinox Container: b75ed7a6-8ac6-4973-b0db-b15f20689f61" #44 daemon prio=5 os_prio=0 cpu=2798765.62ms elapsed=2961.00s tid=0x000001f6bb22b8f0 nid=0x9974 runnable [0x00000012618fd000]
java.lang.Thread.State: RUNNABLE
at java.util.HashMap$HashIterator.nextNode(java.base@17.0.1/HashMap.java:1601)
at java.util.HashMap$KeyIterator.next(java.base@17.0.1/HashMap.java:1620)
at org.apache.felix.resolver.ResolverImpl.mergeUses(ResolverImpl.java:1040)
at org.apache.felix.resolver.ResolverImpl.computeUses(ResolverImpl.java:887)
at org.apache.felix.resolver.ResolverImpl.access$6(ResolverImpl.java:810)
at org.apache.felix.resolver.ResolverImpl$6.run(ResolverImpl.java:1237)
at org.apache.felix.resolver.ResolverImpl$EnhancedExecutor$1.run(ResolverImpl.java:2490)
at java.util.concurrent.Executors$RunnableAdapter.call(java.base@17.0.1/Executors.java:539)
at java.util.concurrent.FutureTask.run(java.base@17.0.1/FutureTask.java:264)
at java.util.concurrent.ThreadPoolExecutor$CallerRunsPolicy.rejectedExecution(java.base@17.0.1/ThreadPoolExecutor.java:2037)
at java.util.concurrent.ThreadPoolExecutor.reject(java.base@17.0.1/ThreadPoolExecutor.java:833)
at java.util.concurrent.ThreadPoolExecutor.execute(java.base@17.0.1/ThreadPoolExecutor.java:1365)
at org.eclipse.osgi.container.ModuleResolver$ResolveProcess.execute(ModuleResolver.java:1553)
at org.apache.felix.resolver.ResolverImpl$EnhancedExecutor.execute(ResolverImpl.java:2502)
at org.apache.felix.resolver.ResolverImpl.calculatePackageSpaces(ResolverImpl.java:1233)
at org.apache.felix.resolver.ResolverImpl.checkConsistency(ResolverImpl.java:614)
at org.apache.felix.resolver.ResolverImpl.findValidCandidates(ResolverImpl.java:574)
at org.apache.felix.resolver.ResolverImpl.doResolve(ResolverImpl.java:437)
at org.apache.felix.resolver.ResolverImpl.resolve(ResolverImpl.java:420)
at org.apache.felix.resolver.ResolverImpl.resolve(ResolverImpl.java:374)
at org.eclipse.osgi.container.ModuleResolver$ResolveProcess.resolveRevisions(ModuleResolver.java:1038)
at org.eclipse.osgi.container.ModuleResolver$ResolveProcess.resolveRevisionsInBatch(ModuleResolver.java:992)
at org.eclipse.osgi.container.ModuleResolver$ResolveProcess.resolve(ModuleResolver.java:909)
at org.eclipse.osgi.container.ModuleResolver.resolveDelta(ModuleResolver.java:172)
at org.eclipse.osgi.container.ModuleContainer.resolveAndApply(ModuleContainer.java:548)
at org.eclipse.osgi.container.ModuleContainer.resolve(ModuleContainer.java:503)
at org.eclipse.osgi.container.ModuleContainer.resolve(ModuleContainer.java:492)
at org.eclipse.osgi.container.Module.start(Module.java:446)
at org.eclipse.osgi.container.ModuleContainer$ContainerStartLevel$2.run(ModuleContainer.java:1847)
at org.eclipse.osgi.internal.framework.EquinoxContainerAdaptor$1$1.execute(EquinoxContainerAdaptor.java:136)
at org.eclipse.osgi.container.ModuleContainer$ContainerStartLevel.incStartLevel(ModuleContainer.java:1840)
at org.eclipse.osgi.container.ModuleContainer$ContainerStartLevel.incStartLevel(ModuleContainer.java:1781)
at org.eclipse.osgi.container.ModuleContainer$ContainerStartLevel.doContainerStartLevel(ModuleContainer.java:1745)
- locked <0x0000000700383580> (a java.lang.Object)
at org.eclipse.osgi.container.ModuleContainer$ContainerStartLevel.dispatchEvent(ModuleContainer.java:1667)
at org.eclipse.osgi.container.ModuleContainer$ContainerStartLevel.dispatchEvent(ModuleContainer.java:1)
at org.eclipse.osgi.framework.eventmgr.EventManager.dispatchEvent(EventManager.java:234)
at org.eclipse.osgi.framework.eventmgr.EventManager$EventThread.run(EventManager.java:345)
But after waiting 45 minutes so far, it feels like a non-terminating NP-complete process.
NP problems are still terminating but might take very long, and what OSGi needs to compute here is the Boolean satisfiability problem what is NP-hard as per todays knowledge.
cc @jukzi
After 1 hour of thread dumps, it's still here:
"Start Level: Equinox Container: b75ed7a6-8ac6-4973-b0db-b15f20689f61" #44 daemon prio=5 os_prio=0 cpu=5011156.25ms elapsed=5276.92s tid=0x000001f6bb22b8f0 nid=0x9974 waiting on condition [0x00000012618fe000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@17.0.1/Native Method)
- parking to wait for <0x00000007b4af1830> (a java.util.concurrent.FutureTask)
at java.util.concurrent.locks.LockSupport.park(java.base@17.0.1/LockSupport.java:211)
at java.util.concurrent.FutureTask.awaitDone(java.base@17.0.1/FutureTask.java:447)
at java.util.concurrent.FutureTask.get(java.base@17.0.1/FutureTask.java:190)
at org.apache.felix.resolver.ResolverImpl$EnhancedExecutor.await(ResolverImpl.java:2522)
at org.apache.felix.resolver.ResolverImpl.calculatePackageSpaces(ResolverImpl.java:1196)
at org.apache.felix.resolver.ResolverImpl.checkConsistency(ResolverImpl.java:614)
at org.apache.felix.resolver.ResolverImpl.findValidCandidates(ResolverImpl.java:574)
at org.apache.felix.resolver.ResolverImpl.doResolve(ResolverImpl.java:437)
at org.apache.felix.resolver.ResolverImpl.resolve(ResolverImpl.java:420)
at org.apache.felix.resolver.ResolverImpl.resolve(ResolverImpl.java:374)
at org.eclipse.osgi.container.ModuleResolver$ResolveProcess.resolveRevisions(ModuleResolver.java:1038)
at org.eclipse.osgi.container.ModuleResolver$ResolveProcess.resolveRevisionsInBatch(ModuleResolver.java:992)
at org.eclipse.osgi.container.ModuleResolver$ResolveProcess.resolve(ModuleResolver.java:909)
at org.eclipse.osgi.container.ModuleResolver.resolveDelta(ModuleResolver.java:172)
at org.eclipse.osgi.container.ModuleContainer.resolveAndApply(ModuleContainer.java:548)
at org.eclipse.osgi.container.ModuleContainer.resolve(ModuleContainer.java:503)
at org.eclipse.osgi.container.ModuleContainer.resolve(ModuleContainer.java:492)
at org.eclipse.osgi.container.Module.start(Module.java:446)
at org.eclipse.osgi.container.ModuleContainer$ContainerStartLevel$2.run(ModuleContainer.java:1847)
at org.eclipse.osgi.internal.framework.EquinoxContainerAdaptor$1$1.execute(EquinoxContainerAdaptor.java:136)
at org.eclipse.osgi.container.ModuleContainer$ContainerStartLevel.incStartLevel(ModuleContainer.java:1840)
at org.eclipse.osgi.container.ModuleContainer$ContainerStartLevel.incStartLevel(ModuleContainer.java:1781)
at org.eclipse.osgi.container.ModuleContainer$ContainerStartLevel.doContainerStartLevel(ModuleContainer.java:1745)
- locked <0x0000000700383580> (a java.lang.Object)
at org.eclipse.osgi.container.ModuleContainer$ContainerStartLevel.dispatchEvent(ModuleContainer.java:1667)
at org.eclipse.osgi.container.ModuleContainer$ContainerStartLevel.dispatchEvent(ModuleContainer.java:1)
at org.eclipse.osgi.framework.eventmgr.EventManager.dispatchEvent(EventManager.java:234)
at org.eclipse.osgi.framework.eventmgr.EventManager$EventThread.run(EventManager.java:345)
I have seen similar stacktraces whenever i update my eclipse IDE too (but never longer then 10min), but do not know how to reproduce it in the IDE. If someone provides a reproducer that i can debug then i could try to make it faster.
But after waiting 45 minutes so far, it feels like a non-terminating NP-complete process.
NP problems are still terminating but might take very long, and what OSGi needs to compute here is the Boolean satisfiability problem what is NP-hard as per todays knowledge.
I wonder if it could be useful to use some simpler heuristic to compute an approcimative solution first and check if it is worth to compute an exact solution using an exact algorithm, for cases that are expected to run long.
Depending in the problem one can be lucky and have good-natured problem-instances that can be solved (almost) optimal with heuristic algorithms that terminatr in polynomial time.
But all that has to be done in the Felix resolver. For now In think the quickest fix ist what Christoph suggested in https://github.com/eclipse-equinox/equinox/issues/84#issuecomment-1191254395.
Just in case someone is interested in the details some can be found here: https://github.com/eclipse/tycho/issues/213#issuecomment-920008047
I wonder if it could be useful to use some simpler heuristic to compute an approcimative solution first and check if it is worth to compute an exact solution using an exact algorithm, for cases that are expected to run long.
Solving the NP problem is considered a millennium problem rewarded by 1 million US $ so please let us know when you find a solution this might solve the "e have not enough developer resources" problem as well :-)
But all that has to be done in the Felix resolver.
Exactly, or Equinox needs to implement an own resolver, I have recommended a while back already to think about merging P2+Equinox resolver but that's nothing to solve in a few hours.
At least both (P2 and Felix) can compute a solution fast when you disable use-clauses, so it might we worth to first compute a solution that is not unique and then start with the non-unique solution.
The problem in general is, that for all "simple" problems this already terminates fast, and the "problematic ones" run slow, so one might end with a similar long runtime anyways. I already noted that LSP4J/LSP4E are giving a hard time to the resolver so it heavily depends on what you are using, there was one example with birt a while back, but I can't reproduce it with the current master (e.g. with -Dtycho.version=2.7.4 -Dtycho.equinox.resolver.uses=true -Dtycho.equinox.resolver.batch.size=10000
)
Four hours later, and still running:
"Start Level: Equinox Container: b75ed7a6-8ac6-4973-b0db-b15f20689f61" #44 daemon prio=5 os_prio=0 cpu=21477562.50ms elapsed=22382.75s tid=0x000001f6bb22b8f0 nid=0x9974 runnable [0x00000012618fd000]
java.lang.Thread.State: RUNNABLE
at org.apache.felix.resolver.util.OpenHashMapList.deepClone(OpenHashMapList.java:35)
at org.apache.felix.resolver.Candidates.copy(Candidates.java:1143)
at org.apache.felix.resolver.ResolverImpl.permuteUsedBlames(ResolverImpl.java:1598)
at org.apache.felix.resolver.ResolverImpl.checkPackageSpaceConsistency(ResolverImpl.java:1464)
at org.apache.felix.resolver.ResolverImpl.checkPackageSpaceConsistency(ResolverImpl.java:1524)
at org.apache.felix.resolver.ResolverImpl.checkConsistency(ResolverImpl.java:621)
at org.apache.felix.resolver.ResolverImpl.findValidCandidates(ResolverImpl.java:574)
at org.apache.felix.resolver.ResolverImpl.doResolve(ResolverImpl.java:437)
at org.apache.felix.resolver.ResolverImpl.resolve(ResolverImpl.java:420)
at org.apache.felix.resolver.ResolverImpl.resolve(ResolverImpl.java:374)
at org.eclipse.osgi.container.ModuleResolver$ResolveProcess.resolveRevisions(ModuleResolver.java:1038)
at org.eclipse.osgi.container.ModuleResolver$ResolveProcess.resolveRevisionsInBatch(ModuleResolver.java:992)
at org.eclipse.osgi.container.ModuleResolver$ResolveProcess.resolve(ModuleResolver.java:909)
at org.eclipse.osgi.container.ModuleResolver.resolveDelta(ModuleResolver.java:172)
at org.eclipse.osgi.container.ModuleContainer.resolveAndApply(ModuleContainer.java:548)
at org.eclipse.osgi.container.ModuleContainer.resolve(ModuleContainer.java:503)
at org.eclipse.osgi.container.ModuleContainer.resolve(ModuleContainer.java:492)
at org.eclipse.osgi.container.Module.start(Module.java:446)
at org.eclipse.osgi.container.ModuleContainer$ContainerStartLevel$2.run(ModuleContainer.java:1847)
at org.eclipse.osgi.internal.framework.EquinoxContainerAdaptor$1$1.execute(EquinoxContainerAdaptor.java:136)
at org.eclipse.osgi.container.ModuleContainer$ContainerStartLevel.incStartLevel(ModuleContainer.java:1840)
at org.eclipse.osgi.container.ModuleContainer$ContainerStartLevel.incStartLevel(ModuleContainer.java:1781)
at org.eclipse.osgi.container.ModuleContainer$ContainerStartLevel.doContainerStartLevel(ModuleContainer.java:1745)
- locked <0x0000000700383580> (a java.lang.Object)
at org.eclipse.osgi.container.ModuleContainer$ContainerStartLevel.dispatchEvent(ModuleContainer.java:1667)
at org.eclipse.osgi.container.ModuleContainer$ContainerStartLevel.dispatchEvent(ModuleContainer.java:1)
at org.eclipse.osgi.framework.eventmgr.EventManager.dispatchEvent(EventManager.java:234)
at org.eclipse.osgi.framework.eventmgr.EventManager$EventThread.run(EventManager.java:345)
Here's what I did...
Use the installer's advanced mode to install Committer 2022-06 and on the second/Project page, include CDO's setup:
Let the whole thing finish creating and provisioning the IDE.
Use Navigate -> Open Setup -> Installation to open the Installation.setup. Double click the root object and change the product version to the 2022-09 committer version:
'
Be sure to save the file:
Use Help -> Perform Setup Tasks. You only need to run the p2 task:
This is the equivalent of updating 2022-06 to 2022-09 (M1 at the moment).
When it's done, the status area will flash a warning icon. Then you can click that icon and click finish to restart:
I'm not sure if such an IDE ever finishes starting, I'm letting it run further.
In other similar scenarios it "only" took 10-20 minutes to restart, but this one seems especially complicated/slow...
@merks I think @tjwatson maybe could tell while the timeout of two minutes do not apply here, but have you tried e.g.
-Dequinox.resolver.revision.batch.size=1
or
-Dequinox.resolver.batch.timeout=5000
Of course I never tried those and never would have thought to try such things. 😢
I tried this just now, and it starts immediately:
./eclipse.exe -clean -vmargs -Dequinox.resolver.revision.batch.size=1
I added the -clean to force it to do a lot of work. So goodness only knows what the one process has been doing for the past 5 hours that apparently can be done in a few seconds depending on the "batch size"...
Of course I never tried those and never would have thought to try such things. cry
They are mentioned in the link I mentioned jut right after your report :-)
So goodness only knows what the one process has been doing for the past 5 hours that apparently can be done in a few seconds depending on the "batch size"...
@tjwatson can better explain the details (see https://github.com/eclipse/tycho/issues/213#issuecomment-920008047 for some explanation), but strictly speaking it is about resolving "all at once" versus "resolve one after the other", sadly it seems there is no isolated test-case for this so investigate what exactly going on (maybe one can zip such a "wait for ever install" but it would still be huge), so maybe:
@laeubi In my defense I was in a hurry to get to the lake. It's 33C here today! 😱
I can empathize that it's annoying when the problem is complex and hard to reproduce. I have had a similar issue now in a number of installations already, so it seems not an isolated problem. One can imaging that if this happens to a naive user, they will be very unimpressed. If it happens to lots of users (after the 2022-09 release) that is just very bad for Eclipse as a whole. I've never seen this problem before so it seems that something has changed to make it happen at all or to make it much worse than ever was before. Any theories about what's different now?
Any theories about what's different now?
Something has changed :-)
But it is not "new" in that i is only happen now see for example mentioned here back in 2017 and here at 2015, the only thing I found out until now is that it seem to happen when there are some very narrow version requirements and many choices, maybe rexport and split packages makes it even worse then ...
Beside that I think @tjwatson might be able to give better guidance, according to him there should be a timeout of 2 minutes (!) before Equinox automatically switch to the batchsize=1
so even if thats a long time for a human being and decades for your computer, there is obviously something here that prevents this emergency stop switch to kick in ...
Do you happen to have a log where logs of resolution issues were logged during this process? The fact that you are seeing this happen during org.eclipse.osgi.container.Module.start(Module.java:446)
indicates to me that we got past the "big" resolves that either the EclipseStarter
did against the initial bundle set (typically only includes org.eclipse.equinox.simpleconfigurator
). But we do not see org.eclipse.core.runtime.adaptor.EclipseStarter.refreshPackages(Bundle[])
on the call stack so it seems we got past trying to resolve the newly installed simpleconfigurator.
What should happen next is the start of simpleconfigurator. It should uninstall/install all the necessary bundles for the update and then eventually call org.eclipse.equinox.internal.simpleconfigurator.ConfigApplier.refreshAllBundles()
to force all the bundles to be refreshed and resolved "at once". This approach was added some time ago to mimic what resolution would happen if you did a -clean
. But we get passed this resolution. But now we must be left with some bundles that could not resolve with that process because now when we try to increment the start-level to the final active start-level it will try to start all the bundle (most lazily). But even for lazy activated bundles the framework first checks if the bundle is resolved, if not it tries to resolve it (again!). It is here that we appear to be "taking forever".
If you have a state you can connect with a debugger it would be helpful to know if you did enter/exit org.eclipse.equinox.internal.simpleconfigurator.ConfigApplier.refreshAllBundles()
and are now continuing on to starting up to the final start-level.
I'm not sure if this is the case here, but the Felix resolver seems to have an issue with deciding which is the correct Provider to select among a list of candidates whose only differentiating attribute is version and in the micro range. Further, I believe that as the number of potential candidates grows the complexity becomes exponentially more difficult for the resolver to eliminate them because particularly for packages, the micro version range is essentially not a differentiation which means there's no criteria for eliminating those chains...
I'm not sure if this is the case here, but the Felix resolver seems to have an issue with deciding which is the correct Provider to select among a list of candidates whose only differentiating attribute is version and in the micro range. Further, I believe that as the number of potential candidates grows the complexity becomes exponentially more difficult for the resolver to eliminate them because particularly for packages, the micro version range is essentially not a differentiation which means there's no criteria for eliminating those chains...
It is true that if we have many requirements that can match many of the same candidate suppliers then the combinations that can be tried grows exponentially. If there is no solution then the algorithm will explode. I thought we had safe guards that would reduce this with a timeout. But here I fear we are in a situation where 100s of bundles are not resolvable and the framework keeps trying to re-resolve them each time we move on to try and start the next bundle in the list of bundles to start during initial start.
This reminds me of an idea I had implemented in the old Equinox resolver for reducing combinations by merging candidate lists when they had identical lists of candidates (https://bugs.eclipse.org/bugs/show_bug.cgi?id=181327)
Today if 100 bundles Import-Package: foo
and there are 3 different foo candidates that can match the requirement from each 100 bundles, but all 3 candidates always result in a conflict that is not solvable the resolver has 100^3 combinations to try before giving up. My idea in https://bugs.eclipse.org/bugs/show_bug.cgi?id=181327 was to instead treat the 100 requirements that all have the same candidates list as one list of 3 to permute over instead of 100 lists of 3 to permute over.
It has been a long time, but if I recall correctly this approach does throw out some valid combinations that could hold the secret solution that is consistent. But I also don't recall a single real situation where that caused an actual problem that did not resolve.
One reason I have not looked into this approach is because I thought we had a timeout "solution" that would back off after some timeout and try resolving a much smaller group of bundles such that the number of permutations to try was manageable. I suspect this is not getting exercised properly because we keep trying to resolve over and over as we try to start each unresolveable bundle.
I'll try to find time to remote debug tomorrow. I just killed the process now after one of the nearby nuclear power plants got overloaded...
@tjwatson
Yes we got past org.eclipse.equinox.internal.simpleconfigurator.ConfigApplier.refreshAllBundles() and then called org.eclipse.equinox.internal.simpleconfigurator.ConfigApplier.startBundles(Bundle[]).
It looks like this after a few minutes, so quite large maps involved.
I'm just not sure what useful information I should try to collect via the debug session.
Are there any logs in configuration/
folder with resolution errors? The log will be in configuration/
with some timestamp.log
name because we don't have a workspace yet.
One possible improvement could be to stash away a ResolutionReport
in the framework and keep track of the ModuleDatabase stamp. If resolution failed in one pass (I think from when ConfigApplier.refreshAllBundles()
got called) we just keep trying again and again each time an unresolved bundle is started later. If you have a large number of bundles that failed resolution then this will just happen over and over, likely with the exact same outcome from a ResolutionReport
. We could check the ModuleDatabase
timestamp and avoid another resolution if nothing has changed to add new capabilities for resolution.
One reason we cannot do this though is I think that it would violate the contract with ResolverHook
services. The specification for starting a bundle indicates that an attempt to resolve must be made if you try to start a bundle that is not resolved. This is observable from a ResolverHook
. So it is unclear we can just ignore that for each unresolved bundle that gets started.
No, there is no log:
After 1 hour, there is only one thread doing useful/interesting work:
Maybe there is useful detail in the variables the ResolverImpl.checkPackageSpaceConsistency(ResolveSession, Resource, Candidates, boolean, Map<Resource, Packages>, Map<Resource, Object>) method?
Maybe this is useful:
I'm not sure it ever leaves this method but rather stays in the loop "forever":
@merks it will breaks if allCandidates = null, but this might never be reached?
There is also the while condition. In any case, the point it seemly never gets the breakpoint on the return statement...
We see a similar delay in a client RCP application in the same code path. @merks did you solve this issue for you?
I've seen the same issue again yesterday creating a brand new 2022-09-base installation. The only workaround is using this option:
-Dequinox.resolver.revision.batch.size=1
But I really don't know the implication of that and it seems to me this problem is new to 2022-09. Hence I've already asked Any theories about what's different now?
New resolver implementation? More use of package imports? Changes to the TP that result in more different bundles but with the same exported packages leading to more complicated wiring problems to resolve? I don't know but no one seems to have a theory of what's new and different now. I'm happy to help debug things but I have no good knowledge of what's going on here...
I've seen the same issue again yesterday creating a brand new 2022-09-base installation. The only workaround is using this option:
-Dequinox.resolver.revision.batch.size=1
But I really don't know the implication of that and it seems to me this problem is new to 2022-09. Hence I've already asked
Any theories about what's different now?
New resolver implementation? More use of package imports? Changes to the TP that result in more different bundles but with the same exported packages leading to more complicated wiring problems to resolve? I don't know but no one seems to have a theory of what's new and different now. I'm happy to help debug things but I have no good knowledge of what's going on here...
I have seem multiple commits moving require-bundle to import packages (which is the recommended way according to the OSGi spec), so I would say, yes we have more import package statements.
@vogella Yes, I expect we do have more package imports. We've removed bundle includes from features which allows more flexibility of bundle versions. But whether that's a cause or not is just a theory backed up by zero facts.
I have a self-contained installation (using the steps above, but without a bundle pool) that reproduces the problem. Someone more knowledgeable could remote debug the problem. Unfortunately that zip is 783MB. I wonder how best to share such a very large zip?
@merks if you have a comparable installation, it might be good to simply diff the plugins folder to see whats new.
You also might reduce the size by strip of JustJ (if its included), and it might be possible to get a (temporary) FTP account on download.eclipse.org to updload the file? At least if such an install could be created in some automated/scripted way you can upload is using a custom build job on the jenkins CI ....
@tjwatson
I've temporarily put this eclipse.zip with the installation on a Google drive:
https://drive.google.com/file/d/1uwD9v71k0ygfiLj5MDXLrESIynT1LwpN/view?usp=sharing
Anyone who wants to try to help debug this problem, please download the zip sooner rather than later.
I cannot promise I will have the necessary time to debug this issue. If this is determined to be a blocking issue for release I would consider temporarily changing the default of the equinox.resolver.revision.batch.size
. When you set that to 1
does it result in no unresolved bundles?
@tjwatson I fully understand. I'm really not sure how prevalent this problem is nor how likely users are to encounter this updating from 2022-06 to 2022-09. I think @jonahgraham quite often tests that for EPP packages...
Good question about unresolved bundles because when I run like this:
$eclipse/eclipse.exe -debug -consolelog -data ws -vmargs -Dequinox.resolver.revision.batch.size=1 2>&1 | tee log
A whole whack of things appear to be failing!
I think @jonahgraham quite often tests that for EPP packages...
I do a test update of (typically) one EPP package from the last R release to the current build to make sure it upgrades cleanly. I haven't seen such startup issues.
By the way might be related Felix issue: https://issues.apache.org/jira/browse/FELIX-5648
@tjwatson
I've temporarily put this eclipse.zip with the installation on a Google drive:
https://drive.google.com/file/d/1uwD9v71k0ygfiLj5MDXLrESIynT1LwpN/view?usp=sharing
Anyone who wants to try to help debug this problem, please download the zip sooner rather than later.
I don't have windows available to try that out on, but I followed the steps you provided to reproduce and I think I have it in an endless resolve now.
I found one issue with the way Equinox uses the Felix Resolver implementation. We always try to resolve all the bundles (optionally) when calling something like Bundle.start
on an unresolved bundle. I don't think this is necessary and there is a bug in the way we do this because we don't let the second call to the resolver for the optional unresolved bundles (the ones not having Bundle.start called on them) to timeout. This is why we see cases where the algorithm explodes and never times-out.
I have a draft PR #95 but I didn't have time to test it out in the "Ed" scenario. @merks if you manage to have time to build this and try it out in your scenario that would be good. I don't think you will be altogether happy with the results though because I think you likely are in an unresolvable state here.
@tjwatson (or anyone else interested in this problem) if you like to easier debug this (no p2, no simpleconfigurator, no restarts...), I created a test-runner for this using the great connect framework facility:
test.ProductToTestCaseTransformer
and pass it the folder (e.g. /tmp/resolver-bug/eclipse)This blocks "forever" at the start (!) of the framework that is no user bundle is marked for start nor is it started at all (I removed all bundle activators and all DS components from the manifest declaration), all my 16 CPUs are getting nearly 100% CPU load and I killed it after a few minute to not overheat them ;-)
I did not set any special properties, just the framework location is set to the users tmp folder + "resolver-test"
@tjwatson I think you hit the nail on the head! Even with remote debug one can often use hot replace to make changes so I tried your change in my SDK IDE and then the remote debugged IDE started in just seconds (after the hot replace) with this in the log:
As I have seen this a lot in the logs:
because it is exposed to package 'org.w3c.dom.events' from resources org.eclipse.osgi [osgi.identity; osgi.identity="org.eclipse.osgi"; type="osgi.bundle"; version:Version="3.18.100.v20220726-1348"; singleton:="true"] and org.w3c.dom.events [osgi.identity; osgi.identity="org.w3c.dom.events"; type="osgi.bundle"; version:Version="3.0.0.draft20060413_v201105210656"] via two dependency chains
I also has noticed this in the past with Tycho and one can "fix" the problem by also importing org.w3c.dom
I've noticed lately that when I update a 2022-06 installation to 2022-09 (M1) that startup takes seemly forever.
Here's what the threads are doing.
resolver-threads.txt
At this point, only the splash screen shows.
Here's a later version of the data inline:
Any update to that installation takes forever or restart; we're talking more than 1/2 hour where you get the impression it's hung or deadlocked.
Did something change in the resolver implementation? Are we using so many more detailed OSGi dependencies (package imports with uses) that causes this to becomes super expensive?