Open petroseskinder opened 7 years ago
The desire to enforce consistency/reproducibility through a policy is reasonable. However, I am skeptical whether taking a greedy approach (choosing the highest version whenever possible) is the best route to handle versioning conflicts.
The problem of finding an installable package configuration (i.e. one without any package conflicts) has been found to be NP complete. The reduction can be made from 3-SAT to this problem.
Greedy, as suggested above, will prematurely fail. If there is a versioning conflict (even if it's induced by its own selection), then it simply halts and spits out the conflict. This makes most sense when the occurrence of conflicts is low. However, with large code bases possessing messy transitive dependency graphs, versioning conflicts will be inevitable. From what I understand, Bazel is targeting these larger code bases.
Alternatively, one can handle versioning conflicts by either (1) augmenting the greedy approach with backtracking, or (2) modeling the dependency graph as a satisfiability statement and using an external SAT solver. As far as pros and cons to each approach. Performance wise, they perform comparably if there is a limited number of conflicting constraints. However, with more conflicts, backtracking's performance degrades terribly, and SAT solving is a clear winner. That said, backtracking is significantly more predictability. One can explicitly state its heuristics (e.g. most recent package) and then have certain expectations as to what it selects. SAT solvers are more of a black box on this front.
The pip community had a similar discussion [0], as did the go community [1].
Between the backtracking approach and SAT solving, I am leaning towards using an external SAT solver like SAT4j to handle versioning conflicts. This is the approach used by OPAM (OCAML's package manager) [2], Dart [3], Maven, and Eclipse [4].
You're probably familiar with the problem. However, if you need a refresher, this article [5] gives a pretty good overview on the topic.
[0] pip community discussion [1] go community discussion [2] OPAM documentation on SAT solver [3] Dart documentation on versioning solver [4] Eclipse announcement of using SAT4j for handling external dependencies [5] Overview article on the problem
@hhclam, thoughts?
I'm concerned about this eagerly downloading an exponential # of pom files. How does that work with SAT4j, can you hook in "callbacks" for get more poms when needed?
AFAICT, most Maven packages use "anything above X" for version so I think you could start with just tracking what version packages are actually requesting (e.g., 1.2.3+ vs 1.2.3) and seeing if that's satisfiable without more info. generate_workspace
isn't even doing that, yet.
I'm concerned about this eagerly downloading an exponential # of pom files.
That is a valid concern. We could unnecessarily download pom files. For example, if we had group:artifact:3.4
and group:artifact:3.5.
, we could fetch transitive dependencies for both artifacts.
Before I respond to the rest of your comment, let me clarify what I had in my mind for how we integrate the solver. I imagine the sequence of steps would be to:
The first step is similar to how we currently do it, with one major distinction. Currently, if we see the same artifact with multiple versions, i.e group:artifact:3.4
and group:artifact:3.5.
we only download the first version's pom.
@Override
public ModelSource resolveModel(String groupId, String artifactId, String version)
throws UnresolvableModelException {
String ruleName = Rule.name(groupId, artifactId);
// <---- COMMENT: notice how the version is not included in the key identifier
if (ruleNameToModelSource.containsKey(ruleName)) {
return ruleNameToModelSource.get(ruleName);
}
...
}
The approach I am proposing would download pom files for each version of an artifact we encounter. I agree this can definitely be overzealous with simpler projects.
That said, we could cache the pom files after the first build, since AFAIK the pom file for a given version of an artifact should not change.
How does that work with SAT4j, can you hook in "callbacks" for get more poms when needed?
From what I understand, SAT4j is a domain independent SAT solver. It operates on the satisfiability/constraint statement rather than our dependency graph.
If we lazily fetch pom files, we will be introducing new constraints to the SAT solver, and AFAIK SAT solvers operate on static statements. Thus, making lazy fetching a no go.
Both Maven and Eclipse use SAT4j to handle transitive dependencies, so I can investigate how they go about it.
AFAICT, most Maven packages use "anything above X" for version so I think you could start with just tracking what version packages are actually requesting (e.g., 1.2.3+ vs 1.2.3) and seeing if that's satisfiable without more info.
I see. A few comments on this front.
This would definitely reduce the size of our precomputed dependency graph. If we found a node for 1.2.5, we would not need to construct 1.2.3's dependency subgraph. However, here are some concerns:
X:2.3
, b:1.2
, and suppose X:2.3
depends on b:1.4
. Depending on whether we traverse X:2.3
or b:1.2
first, the version of b varies between either 1.2 or 1.4.One way to ensure consistency would be to use semantic versioning, e.g. "1.2.3" actually means [1.2.3, 2.0.0)? Maven packages are supposed to version based on MAJOR.MINOR.PATCH where only MAJOR changes are breaking. However, in practice, a nontrivial chunk of MINOR (1/3rd) and PATCH (1/4th) changes cause breaks.
If we trust in semvar and aren't paranoid, then we could treat all packages versioned 1.x.x equivalently. We just fetch the newest package before version 2.0.0. Again, this requires making a number of assumptions (or leaps of faith for the more cynical) of how things are packaged.
cc: @hhclam
- construct the dependency graph from provided artifacts/pom files (including transitive dependencies)
- transform the dependency graph into a boolean satisfiability statement
The problem is I don't think you can make it to 2 without potentially downloading the universe.
Say we're depending on foo:bar:_any version_
and there are 43 versions available, all with different sets of dependencies at different versions. We'd have to download them all and add them all to our equation before passing it to the SAT solver. That is a no-go, or at least a last resort.
AFAICT Maven just uses "nearest wins" for this (https://maven.apache.org/plugins/maven-dependency-plugin/examples/resolving-conflicts-using-the-dependency-tree.html). I don't know about Eclipse, if they're doing something smart, feel free to look into it.
That said, we could cache the pom files after the first build, since AFAIK the pom file for a given version of an artifact should not change.
This tool does not have a cache.
One way to ensure consistency would be to use semantic versioning, e.g. "1.2.3" actually means [1.2.3, 2.0.0)?
1.2.3 actually means 1.2.3+ in Maven (see http://maven.apache.org/enforcer/enforcer-rules/versionRanges.html). generate_workspace
currently doesn't respect that, but it should.
We cannot enforce semantic versioning on packages.
The problem is I don't think you can make it to 2 without potentially downloading the universe.
I am in agreement on this front.
Say we're depending on
foo:bar:_any version_
and there are 43 versions available, all with different sets of dependencies at different versions. We'd have to download them all and add them all to our equation before passing it to the SAT solver. That is a no-go, or at least a last resort.
Ah I see. Thanks for the concrete example. That would be a logical conclusion.
Returning to your suggestion of lazily determining version. One potential route that we could take is to leave foo:bar:_any version_
as to leave as unresolved. Then, if at any point we find a constrained version of foo:bar
, e.g. goo:car --> foo:bar:3.0.1
, we reduce the universe of possible foo:bar to 3.0.1+.
AFAICT Maven just uses "nearest wins" for this (https://maven.apache.org/plugins/maven-dependency-plugin/examples/resolving-conflicts-using-the-dependency-tree.html). I don't know about Eclipse, if they're doing something smart, feel free to look into it.
Interesting. According to this research paper circa 2013, "Maven, decided recently to use SAT technologies to resolve their package dependencies"
Conflicting info, but I'll lean towards trusting the official maven docs.
This tool does not have a cache.
Would it be a bad or wasteful move to include/develop?
1.2.3 actually means 1.2.3+ in Maven (see http://maven.apache.org/enforcer/enforcer-rules/versionRanges.html). generate_workspace currently doesn't respect that, but it should.
We cannot enforce semantic versioning on packages.
Fair. I was trying to enforce things that I shouldn't.
Returning to your suggestion of lazily determining version. One potential route that we could take is to leave foo:bar:any version as to leave as unresolved. Then, if at any point we find a constrained version of foo:bar, e.g. goo:car --> foo:bar:3.0.1, we reduce the universe of possible foo:bar to 3.0.1+.
SGTM, feel free to start sending PRs :)
Conflicting info, but I'll lean towards trusting the official maven docs.
Interesting, maybe Maven was going to use it on the Maven Central servers or something. That would make more sense to me.
This tool does not have a cache. Would it be a bad or wasteful move to include/develop?
I'm not against it per se, it's just that there are a lot of features currently missing and I'd rather focus on those than get it working well with a cache.
This issue persists despite #45.
Migrated from issue 1573 in main repository
If generate_workspace is run with artifacts that depend on a common artifact, it will choose the version from whichever it encounters first. Alternately, it would be better (I think) if it instead chose the highest version. For example:
Generates:
But if we reverse the artifacts:
Now we get:
Choosing the highest would also ensure the output is the same regardless of which ordering is used for the artifacts.
For context, this is coming out of a desire to manage our dependencies with a simple list of maven artifacts, and then generating the
WORKSPACE
andBUILD
files automatically. Currently, it seems like either we have to be careful with the ordering of our list of artifacts, or we have to modify the generatedWORKSPACE
after the fact.