Open sschuberth opened 7 years ago
So this is a very timely question of yours! Eventually the main schema definition for dependencies should be here but in practice its primary user is ScanCode. The initial schema is there https://github.com/nexB/scancode-toolkit/blob/865d24874fa39acaca40e690ea111efb5edaa8ff/src/packagedcode/models.py#L470 and is NOT something I want to further. In particular the data structure is a list of mappings keyed by dependency group which does not make sense.
Instead I think things should be much simpler: a list of deps where each dep has just a name and version. The version can be either a resolved version or a version constraint depending on the context.
Using a yaml representation this would then looks like this:
dependencies:
- group: test
name: junit
version: '>=3'
- group: test
name: dbunit
version: '2.53'
The only other thing that may be needed is a flag that would indicate in a more explicit way whether a dependency is required or not... this would end up looking like this:
dependencies:
- group: test
name: junit
required: yes
version: '<3.8'
- group: test
name: dbunit
required: no
version: '2.53'
- group: runtime
name: apache-commons
required: yes
version: 1.0.2
Some noteworthy things:
/cc @mnonnenmacher who's working on our trial-adoption of ABCD as an output format of what you would call dependentcode ;-)
@mnonnenmacher any feedback on your side?
Hi @pombredanne,
let me show you how we have interpreted the ABCD format.
Our dependency analyzer creates two object types: project and package. Project represents a software package including its resolved dependencies, while package contains only information about the software package itself, but not about its dependencies. The reason we excluded dependencies from package is that we are currently only interested in resolved dependencies, not declared dependencies, and most package managers automatically change versions of transitive dependencies, or allow the users to manipulate transitive dependencies. For example Maven/Gradle allow to exclude transitive dependencies, and try to resolve version conflicts.
I'll use YAML for the examples below, because it is easier to read.
Package
An example for a random NPM project:
package_manager: "NPM"
namespace: ""
name: "yallist"
description: "Yet Another Linked List"
version: "2.1.2"
homepage_url: "https://github.com/isaacs/yallist#readme"
download_url: "https://registry.npmjs.org/yallist/-/yallist-2.1.2.tgz"
hash: "1c11f9218f076089a47dd512f93c6699a6a81d52"
hashAlgorithm: ""
vcs_path: ""
vcs_provider: "git"
vcs_url: "git+https://github.com/isaacs/yallist.git"
vcs_revision: "566cd4cd1e2ce57ffa84e295981cd9aa72319391"
namespace is what you call group above, we decided for the more generic namespace because group is only used in the Maven/Gradle world. This is used for all package managers that have this concept, currently for the group id of Maven and Gradle and for the scope (like "@types") of NPM.
As a unique identifier to reference a package we use the quadruple of _packagemanager, namespace, name, and version. We have added _packagemanager to make it more unique, because in theory there could be e.g. NPM and Maven artifacts that share the same namespace, name, and version.
_vcspath is used to specify a subdirectory in the repository where the package is located, as sometimes repositories contain multiple independent projects.
The other properties should be self explanatory.
Also we plan to add: _source_downloadurl: Reference to a source package, e.g. source JARs for Java. _sourcehash: Hash for the field above. licenses: List of licenses found in the package.
Project
package_manager: "NPM"
namespace: ""
name: "jquery"
aliases: []
version: "3.2.2-pre"
vcs_path: ""
vcs_provider: "git"
vcs_url: "https://github.com/jquery/jquery.git"
vcs_revision: "7037facc2243ec24c2b36b770236c05d300aa513"
homepage_url: "https://jquery.com"
scopes: []
Mostly the same as package, the main difference is the scopes property. This contains a tree of resolved dependencies for each scope, e.g. for NPM scopes are "dependencies", "devDependencies", "peerDependencies", and so on. For Maven/Gradle this would be "compile", "test", and so on.
The scopes for JQuery look like this:
scopes:
- name: "dependencies"
delivered: true
dependencies: []
- name: "devDependencies"
delivered: false
dependencies:
- name: "babel-core"
namespace: ""
version: "7.0.0-beta.0"
package_hash: "843582d0de9181585dc2991573c3e165a89eaed4"
dependencies:
- name: "babel-code-frame"
namespace: ""
version: "7.0.0-beta.0"
package_hash: "418a7b5f3f7dc9a4670e61b1158b4c5661bec98d"
dependencies:
- ...
The delivered property is a draft, it defines if dependencies from this scope are included in the product, usually test dependencies are not. All elements in dependencies are package references including the resolved transitive dependencies. Note that we don't need the _packagemanager for the reference here, as it is already defined by the project. In the beginning we used _packagehash for the reference, but then decided for the triple of namespace, name, and version instead because of human readability, so this property will likely be removed.
I have put a full example for JQuery on a Gist: https://gist.github.com/mnonnenmacher/c0dcb5a41ba3d1b646c6425dd2edfce9
It would be great to get some feedback from you about our usage of ABCD, and how close it is to what you have envisioned for the format.
@mnonnenmacher this is very well done! This matches very well the vision. I may come with a few comments on the details for certain names that we could possibly refine together.
Advantages of my Approach Flat Structure: By keeping dependencies flat, you avoid the complexity of nested structures. This makes it easier to process and understand the dependencies in a linear fashion. It also simplifies data handling and manipulation.
Consistent Representation: Treating resolved and unresolved dependencies the same way, with the only difference being in the context (e.g., version constraints vs. resolved versions), ensures that your data structure is consistent and easier to work with.
Clear Indication of Requirement: Adding a required flag makes it explicit whether a dependency is essential or optional. This can help with prioritizing dependencies and managing their installation or resolution.
Simplified Schema: Your schema’s simplicity ensures that it’s easy to generate, read, and maintain. It avoids unnecessary complexity and keeps the focus on the core attributes: name, version, group, and requirement status.
Example YAML Structure Here’s a refined version of your YAML schema based on your description:
yaml Copy code dependencies:
Dependency Groups: The group attribute is useful for categorizing dependencies. Make sure that the groups are well-defined and that your system correctly handles dependencies across different groups.
Flag Handling: The required flag should be consistently interpreted in your system. Decide whether yes and no are the best values, or if a boolean (true/false) might be simpler.
Extensibility: Think about whether you might need additional attributes in the future. If you do, ensure your schema can be extended without disrupting existing data.
Implementation Considerations Data Validation: Implement validation to ensure that each dependency has valid attributes and that the versions and constraints are correctly formatted.
Integration with Tools: Ensure that this schema integrates well with existing tools like ScanCode. If you’re moving away from an existing schema, consider how you’ll transition or map the old schema to the new one.
Documentation: Provide clear documentation for how to use this schema, including examples and explanations of each attribute. This will help users understand and implement it correctly.
Your proposed schema is a solid step towards a more manageable and understandable representation of dependencies. It should facilitate easier manipulation and integration of dependency data in various contexts.
When describing software packages, like Java libraries, it's quite essential to also capture any dependencies / relationships between packages. The current ABCD spec seems to be quite loosely defined in this regard. A bit too loose for my taste probably. Could you give a concrete example how e.g. the dependency of
mockito-core
onjunit
would be represented in ABCD in YAML format?