bazelbuild / migration-tooling

Migration tools for Bazel
Apache License 2.0
45 stars 30 forks source link

generate_workspace should/could choose highest maven version number #16

Open petroseskinder opened 7 years ago

petroseskinder commented 7 years ago

Migrated from issue 1573 in main repository

If generate_workspace is run with artifacts that depend on a common artifact, it will choose the version from whichever it encounters first. Alternately, it would be better (I think) if it instead chose the highest version. For example:

% bazel run //src/tools/generate_workspace -- \
  --artifact=com.amazonaws:aws-java-sdk-core:jar:1.11.10 \
  --artifact=io.dropwizard:dropwizard-core:jar:1.0.0

Generates:

 # com.fasterxml.jackson.datatype:jackson-datatype-guava:bundle:2.7.6 wanted version 2.7.6
 # io.dropwizard:dropwizard-jackson:jar:1.0.0 wanted version 2.7.6
 # com.fasterxml.jackson.datatype:jackson-datatype-jsr310:bundle:2.7.6 wanted version 2.7.6
 # com.fasterxml.jackson.module:jackson-module-afterburner:bundle:2.7.6 wanted version 2.7.6
 # com.amazonaws:aws-java-sdk-core:jar:1.11.10
 # com.fasterxml.jackson.datatype:jackson-datatype-jdk8:bundle:2.7.6 wanted version 2.7.6
 maven_jar(
     name = "com_fasterxml_jackson_core_jackson_databind",
     artifact = "com.fasterxml.jackson.core:jackson-databind:2.6.6",
     sha1 = "5108dde6049374ba980b360e1ecff49847baba4a",
 )

But if we reverse the artifacts:

% bazel run //src/tools/generate_workspace -- \
  --artifact=io.dropwizard:dropwizard-core:jar:1.0.0 \
  --artifact=com.amazonaws:aws-java-sdk-core:jar:1.11.10

Now we get:

# com.fasterxml.jackson.datatype:jackson-datatype-guava:bundle:2.7.6
# com.fasterxml.jackson.module:jackson-module-afterburner:bundle:2.7.6
# com.amazonaws:aws-java-sdk-core:jar:1.11.10 wanted version 2.6.6
# io.dropwizard:dropwizard-jackson:jar:1.0.0
# com.fasterxml.jackson.datatype:jackson-datatype-joda:bundle:2.7.6
# com.fasterxml.jackson.datatype:jackson-datatype-jsr310:bundle:2.7.6
# com.fasterxml.jackson.datatype:jackson-datatype-jdk8:bundle:2.7.6
maven_jar(
    name = "com_fasterxml_jackson_core_jackson_databind",
    artifact = "com.fasterxml.jackson.core:jackson-databind:2.7.6",
    sha1 = "c7012a59b4f36843489753a2027e169c3b8586f8",
)

Choosing the highest would also ensure the output is the same regardless of which ordering is used for the artifacts.

For context, this is coming out of a desire to manage our dependencies with a simple list of maven artifacts, and then generating the WORKSPACE and BUILD files automatically. Currently, it seems like either we have to be careful with the ordering of our list of artifacts, or we have to modify the generated WORKSPACE after the fact.

petroseskinder commented 7 years ago

The desire to enforce consistency/reproducibility through a policy is reasonable. However, I am skeptical whether taking a greedy approach (choosing the highest version whenever possible) is the best route to handle versioning conflicts.

The problem of finding an installable package configuration (i.e. one without any package conflicts) has been found to be NP complete. The reduction can be made from 3-SAT to this problem.

Greedy Approach

Greedy, as suggested above, will prematurely fail. If there is a versioning conflict (even if it's induced by its own selection), then it simply halts and spits out the conflict. This makes most sense when the occurrence of conflicts is low. However, with large code bases possessing messy transitive dependency graphs, versioning conflicts will be inevitable. From what I understand, Bazel is targeting these larger code bases.

Alternative Approaches

Alternatively, one can handle versioning conflicts by either (1) augmenting the greedy approach with backtracking, or (2) modeling the dependency graph as a satisfiability statement and using an external SAT solver. As far as pros and cons to each approach. Performance wise, they perform comparably if there is a limited number of conflicting constraints. However, with more conflicts, backtracking's performance degrades terribly, and SAT solving is a clear winner. That said, backtracking is significantly more predictability. One can explicitly state its heuristics (e.g. most recent package) and then have certain expectations as to what it selects. SAT solvers are more of a black box on this front.

The pip community had a similar discussion [0], as did the go community [1].

Between the backtracking approach and SAT solving, I am leaning towards using an external SAT solver like SAT4j to handle versioning conflicts. This is the approach used by OPAM (OCAML's package manager) [2], Dart [3], Maven, and Eclipse [4].

You're probably familiar with the problem. However, if you need a refresher, this article [5] gives a pretty good overview on the topic.

[0] pip community discussion [1] go community discussion [2] OPAM documentation on SAT solver [3] Dart documentation on versioning solver [4] Eclipse announcement of using SAT4j for handling external dependencies [5] Overview article on the problem

@hhclam, thoughts?

kchodorow commented 7 years ago

I'm concerned about this eagerly downloading an exponential # of pom files. How does that work with SAT4j, can you hook in "callbacks" for get more poms when needed?

AFAICT, most Maven packages use "anything above X" for version so I think you could start with just tracking what version packages are actually requesting (e.g., 1.2.3+ vs 1.2.3) and seeing if that's satisfiable without more info. generate_workspace isn't even doing that, yet.

petroseskinder commented 7 years ago

I'm concerned about this eagerly downloading an exponential # of pom files.

That is a valid concern. We could unnecessarily download pom files. For example, if we had group:artifact:3.4 and group:artifact:3.5., we could fetch transitive dependencies for both artifacts.

How we would integrate solver

Before I respond to the rest of your comment, let me clarify what I had in my mind for how we integrate the solver. I imagine the sequence of steps would be to:

  1. construct the dependency graph from provided artifacts/pom files (including transitive dependencies)
  2. transform the dependency graph into a boolean satisfiability statement
  3. input the statement into SAT4j to find an assignment or state that none exists
  4. recover the specific packages from this assignment

The first step is similar to how we currently do it, with one major distinction. Currently, if we see the same artifact with multiple versions, i.e group:artifact:3.4 and group:artifact:3.5. we only download the first version's pom.

  @Override
  public ModelSource resolveModel(String groupId, String artifactId, String version)
      throws UnresolvableModelException {
    String ruleName = Rule.name(groupId, artifactId);
    // <---- COMMENT: notice how the version is not included in the key identifier
    if (ruleNameToModelSource.containsKey(ruleName)) {
      return ruleNameToModelSource.get(ruleName);
    }
    ...
  }

The approach I am proposing would download pom files for each version of an artifact we encounter. I agree this can definitely be overzealous with simpler projects.

That said, we could cache the pom files after the first build, since AFAIK the pom file for a given version of an artifact should not change.

SAT4j callbacks and lazily fetching pom files

How does that work with SAT4j, can you hook in "callbacks" for get more poms when needed?

From what I understand, SAT4j is a domain independent SAT solver. It operates on the satisfiability/constraint statement rather than our dependency graph.

If we lazily fetch pom files, we will be introducing new constraints to the SAT solver, and AFAIK SAT solvers operate on static statements. Thus, making lazy fetching a no go.

Maven and Eclipse

Both Maven and Eclipse use SAT4j to handle transitive dependencies, so I can investigate how they go about it.

"Anything above X" e.g. 1.2.3 --> 1.2.3+

AFAICT, most Maven packages use "anything above X" for version so I think you could start with just tracking what version packages are actually requesting (e.g., 1.2.3+ vs 1.2.3) and seeing if that's satisfiable without more info.

I see. A few comments on this front.

This would definitely reduce the size of our precomputed dependency graph. If we found a node for 1.2.5, we would not need to construct 1.2.3's dependency subgraph. However, here are some concerns:

  1. Our dependency graph would be highly variable, even more so than now. Our graph would depend on the order in which which we traverse the pom files as well as the dependencies within each pom file. For example, say our project has a dependency on X:2.3, b:1.2, and suppose X:2.3 depends on b:1.4. Depending on whether we traverse X:2.3 or b:1.2 first, the version of b varies between either 1.2 or 1.4.
  2. More importantly, so long as we use a SAT solver, we would still have to precompute the entire graph before we identify satisfiability. This change would simply reduce the size of the graph.

Semantic Versioning

One way to ensure consistency would be to use semantic versioning, e.g. "1.2.3" actually means [1.2.3, 2.0.0)? Maven packages are supposed to version based on MAJOR.MINOR.PATCH where only MAJOR changes are breaking. However, in practice, a nontrivial chunk of MINOR (1/3rd) and PATCH (1/4th) changes cause breaks.

If we trust in semvar and aren't paranoid, then we could treat all packages versioned 1.x.x equivalently. We just fetch the newest package before version 2.0.0. Again, this requires making a number of assumptions (or leaps of faith for the more cynical) of how things are packaged.

cc: @hhclam

kchodorow commented 7 years ago
  1. construct the dependency graph from provided artifacts/pom files (including transitive dependencies)
  2. transform the dependency graph into a boolean satisfiability statement

The problem is I don't think you can make it to 2 without potentially downloading the universe.

Say we're depending on foo:bar:_any version_ and there are 43 versions available, all with different sets of dependencies at different versions. We'd have to download them all and add them all to our equation before passing it to the SAT solver. That is a no-go, or at least a last resort.

AFAICT Maven just uses "nearest wins" for this (https://maven.apache.org/plugins/maven-dependency-plugin/examples/resolving-conflicts-using-the-dependency-tree.html). I don't know about Eclipse, if they're doing something smart, feel free to look into it.

That said, we could cache the pom files after the first build, since AFAIK the pom file for a given version of an artifact should not change.

This tool does not have a cache.

One way to ensure consistency would be to use semantic versioning, e.g. "1.2.3" actually means [1.2.3, 2.0.0)?

1.2.3 actually means 1.2.3+ in Maven (see http://maven.apache.org/enforcer/enforcer-rules/versionRanges.html). generate_workspace currently doesn't respect that, but it should.

We cannot enforce semantic versioning on packages.

petroseskinder commented 7 years ago

The problem is I don't think you can make it to 2 without potentially downloading the universe.

I am in agreement on this front.

Say we're depending on foo:bar:_any version_ and there are 43 versions available, all with different sets of dependencies at different versions. We'd have to download them all and add them all to our equation before passing it to the SAT solver. That is a no-go, or at least a last resort.

Ah I see. Thanks for the concrete example. That would be a logical conclusion.

Returning to your suggestion of lazily determining version. One potential route that we could take is to leave foo:bar:_any version_ as to leave as unresolved. Then, if at any point we find a constrained version of foo:bar, e.g. goo:car --> foo:bar:3.0.1, we reduce the universe of possible foo:bar to 3.0.1+.

AFAICT Maven just uses "nearest wins" for this (https://maven.apache.org/plugins/maven-dependency-plugin/examples/resolving-conflicts-using-the-dependency-tree.html). I don't know about Eclipse, if they're doing something smart, feel free to look into it.

Interesting. According to this research paper circa 2013, "Maven, decided recently to use SAT technologies to resolve their package dependencies"

Conflicting info, but I'll lean towards trusting the official maven docs.

This tool does not have a cache.

Would it be a bad or wasteful move to include/develop?

1.2.3 actually means 1.2.3+ in Maven (see http://maven.apache.org/enforcer/enforcer-rules/versionRanges.html). generate_workspace currently doesn't respect that, but it should.

We cannot enforce semantic versioning on packages.

Fair. I was trying to enforce things that I shouldn't.

kchodorow commented 7 years ago

Returning to your suggestion of lazily determining version. One potential route that we could take is to leave foo:bar:any version as to leave as unresolved. Then, if at any point we find a constrained version of foo:bar, e.g. goo:car --> foo:bar:3.0.1, we reduce the universe of possible foo:bar to 3.0.1+.

SGTM, feel free to start sending PRs :)

Conflicting info, but I'll lean towards trusting the official maven docs.

Interesting, maybe Maven was going to use it on the Maven Central servers or something. That would make more sense to me.

This tool does not have a cache. Would it be a bad or wasteful move to include/develop?

I'm not against it per se, it's just that there are a lot of features currently missing and I'd rather focus on those than get it working well with a cache.

petroseskinder commented 7 years ago

This issue persists despite #45.