Closed carlobaldassi closed 4 weeks ago
Just a couple of passing comments:
@IanButterworth
- Should we add the reproducer as a test? Maybe we need to check out General at a fixed commit to do so? or do we have a smaller MWE?
I have now added tests in a new commit, but I had to work around the difficulties in finding a MWE that I was mentioning in the OP. So the tests are based on a reduced version, which however would work fine even before this patch, only it would take a couple of minutes or so. With the patch, it takes less than a second. So the test uses a timeout with a limit of 10 seconds, after which it interrupts the resolver.
In order to do this, I had to implement a timeout macro and add an explicit yield
call in the middle of the resolver code (here) to make it interruptible. From what I think I understood, this is basically unavoidable, but I am not expert and thus I'm not very confident that my approach is entirely correct or the best one. I also don't know if it has drawbacks. Probably someone more familiar with scheduling, tasks, async calls etc should have a look.
- Some of these changes feel adjacent to Initialy display of possible versions doesn't take into account yanked versions #3817 just in case the fix to that is obvious in the context of this
I don't think that issue can be addressed in this context.
For the timeout stuff, what about instead of yield and throwto (which AFAIK wouldn't be robust) replace the yield with a check for the elapsed time since the resolve started, against a global timeout variable that's high for normal usage and tests could reduce for test speed?
Is this good to go?
Is this good to go?
It might be good to go, but then there was @IanButterworth's comment about setting a timeout in the resolver itself. I have been thinking about how to do it but didn't get around to give it a try due to lack of time, sorry. I'll give it a go as soon as I can.
I just pushed a different commit, superseding the previous one. Now I have added a timeout to the solver (actually just the maxsum part), with a default value of 300 seconds, which can be set with ENV["JULIA_PKG_RESOLVE_MAX_TIME"]
. In the new tests, the maximum time is set to 10 seconds so that we check that the changes have the desired effect. Then there is also a test with a ridiculously short time so as to test the timeout mechanism itself.
One slightly annoying side effect is that I had to introduce a new kind or error, ResolverTimeoutError
, so as to distinguish unsatisfiable constraints from timed-out runs. The error message suggests changing the setting.
The implementation is not particularly sophisticated or precise, it's based on a Timer
that gets checked before attempting to set any new batch of package versions. I still had to introduce a yield
call just before checking the Timer
in order to give it a chance to update.
If this looks ok then it should be good to go.
Of course I'm open to suggestions, in particular I picked the default timeout time quite arbitrarily.
Sounds good to me!
This fixes #3878 by adding an extra step during graph simplification, which I'm calling "version validation".
The validation consists in going through each version of each package (including the "uninstalled" pseudo-version), temporarily pinning it, and checking whether constraint propagation would lead to impossible scenarios. Versions which do that can be disabled to further simplify the graph. But most importantly it may turn out that all versions of a package are invalid, in which case the resolver has proved that the problem is unsatisfiable and it stops with an error.
In order to make this a little faster, I'm using a heuristic in which we don't really check all versions of a package, we stop as soon as a valid version is found (a proper one, not "uninstalled"), starting from higher versions and going down. This retains the property that impossible packages may be detected.
In my experiments, the overhead seems acceptable (nearly unnoticeable most of the times, sometimes it even makes things faster by passing a simpler graph to the solver). However, I would urge other people to give this a try, especially if someone has projects involving many dependencies.
The main commit is the last one, the others are largely independent (please don't squash them); I may create a separate PR for each if deemed necessary. It's all stuff that came out while I was working on the main issue.
Besides speed concerns, there are two more minor issues:
I haven't added new tests specific to the problem in #3878. The problem is that in order to trigger the issue we need such a big graph that just parsing it takes minutes (locally, I have been using a 15Mb text data file).(now there are tests)"Optimization [7f7a1694]: restricted by version validation to versions: 3.5.0 - 3.19.3 (7/22 versions removed)"
). If all versions of a package are removed, then we say so, and trigger a failure by providing an example log of what would happen if installing one of the versions (example below).Advice/comments about these points would also be welcome.
Below is an example of the validator catching an impossible situation (the same described here) before the actual optimizer runs (this is on julia master, resolving takes 2 or 3 seconds on my laptop). Here it was found that
RigidBodyDynamics
cannot be installed. So, for the sake of providing a log, its version was pinned at 2.4.0 and then propagating the constraints triggered an issue withModelingToolkit
. If you look closely, you'll see that in theRigidBodyDynamics
sub-tree there is the sentence"pinned to version 2.4.0 during version validation"
.