clearlydefined / crawler

A service that crawls projects and packages for information relevant to ClearlyDefined
MIT License
43 stars 30 forks source link

conda crawler implementation #532

Closed lamarrr closed 2 weeks ago

lamarrr commented 6 months ago

closes #535

This Merge request is intended to track the work in implementing the crawler for conda source packages.

Background

Conda exposes packages in a different format from other Python repositories like pypi. Conda is a Python environment locked to a specific Python version. Conda deals with packages locked to a specific version for a version of the channel, this ensures packages do not break due to one incompatibility or another as the packages are managed for compatibility, similar to how you'd ship a docker container. The primary consumption point is the "packages" themselves which are accompanied with scripts to modify the environment and setup the packages and dependencies which are then consumed by the setup application, the packages may also contain DLLs, scripts, compiled Python binary (.pyc), python code. etc. The structure of conda repositories and their indexing process is described here: https://docs.conda.io/projects/conda-build/en/stable/concepts/generating-index.html

Conda has three main channels: anaconda-main, anaconda-r, and conda-forge which is more geared toward business uses

We crawl both the packages and the source code (not always specified) for the licensing metadata and other metadata about the package.

the source from which the conda packages are created is often but not always provided via a URL that links a compressed source file hosted externally, sometimes via GitHub, or another website. note that this is a file and not a git repository. the main conda package is hosted on the conda channels themselves and is compressed and contains necessary licensing information, compilers, environment configuration scripts, dependencies, etc. that are needed to make the package work.

The crawler uses the coordinates of the syntax:

type: conda | condasource
provider: conda-forge | anaconda-main | anaconda-r
namespace: ${architecture}
name: any
revision:  (${version} |  )-(${buildversion} |  )

i.e.

conda/conda-forge/linux-aarch64/numpy/1.13.0
condasource/conda-forge/linux-aarch64/numpy/1.13.0
conda/conda-forge/-/numpy/1.13.0/
conda/conda-forge/linux-aarch64/numpy/-py36

where type (required): conda or condasource namespace (optional): architecture and OS of the package to be crawled i.e. win64, linux-aarch64, if no architecture is specified any architecture is chosen. package name: name of the package provider (required): channel on which the package will be crawled. conda-forge, anaconda-main, or anaconda-r revision (optional): package version and optional build version i.e. 0.3.0, 0.3.0-py36hffe2fc. if it is a conda coordinate type and the build version of the package is usually a conda-specific representation of the build tools and environment configuration, and build iteration of the package. i.e. for a Python 3.9 environment, this could be py39H443E. if none is specified, the latest one will be selected using the package's timestamp.

Conda-forge is a community effort and packages are published by opening PRs on their GitHub repository as described here https://conda-forge.org/docs/maintainer/adding_pkgs.html

qtomlinson commented 4 months ago

It is exciting to see that a new harvester is being implemented! This pull request provides a solid foundation for future enhancements. A discussion is needed on the proposed coordinates, e.g. conda/conda-forge/-/numpy/1.13.0_linux-aarch64/py36. In the above proposal, toolVersion is used to represent the build (string) in conda. Points to consider:

One possible alternative is to mirror the package search standard specification in the CD's coordinates.
image

The mapping would be as follows:

channel -> provider
subdir -> namespace
name -> name
`${version}-${build}` -> revision 

Both version and build may not contain "-" (see https://conda.io/projects/conda/en/latest/user-guide/concepts/pkg-specs.html#info-index-json). So using "-" as separator works here.

Additionally, when architecture platform is not specified, should 'noarch' be considered as the default value (https://docs.anaconda.com/anaconda-repository/user-guide/tasks/pkgs/use-noarch-pkgs/)? This likely produces more predictive results.

@elrayle @capfei @jeffwilcox @mpcen @pombredanne @bduranc I am not particularly knowledgeable in the conda ecosystem, and would very much appreciate other experts' input.

bduranc commented 4 months ago

I'm okay with both proposals. Qing's is in-line with Conda's own standards, which I assume are applicable to all three channels.

I also assume the type coordinate in both proposals would still be "conda" or "condasource". Also, if there is a GH or GitLab source location present in the channel / package metadata, is the intent also to populate the same in the definition?

In the past, I believe we've tried our best to keep new provider coordinate formats consistent with the others (for example, when we added Debian/debsrc support back a few years ago, I believe that took some influence from the Maven implementation). It would be best to follow that practice as much as we can here too.

lamarrr commented 4 months ago

toolVersion is already used internally by ClearlyDefined as the harvest tool versioning. Using spec.toolVersion will cause conflicts.

I had the impression toolVersion was referring to the tool the package was built with and not of the tooling scanning the licensing info. which I feel would have been less ambiguous if it was at the beginning of the package coordinate.

Adding /py36 adds complexity to service APIs. For instance, harvest data api expects /harvest/{type}/{provider}/{namespace}/{name}/{revision}/{tool}.

Agreed, I'll make it an optional parameter appended to the revision instead

Concatenating version and architecture with does not handle versions that contain , e.g. "version": "1.30.0_2018_09_30"

Agreed, but I couldn't find a better delimiter to use. I performed a regex search on some of the channels and none of them had that kind of versioning (with '_' in them). it's always semantic versioning (numbers and hyphens only with alpha/beta, https://semver.org/).

The new revision can be {architecture}--{version}-{build}.

since architecture can be linux-64.

I feel noarch isn't the right thing. noarch is for platform-agnostic packages which may or may not be present. i.e. 7zip isn't platform-agnostic but is architecture and os dependent so it is not on the noarch list which would lead to fetching it without specifying the architecture to fail. We presently select randomly from any architecture the package is available on (just as is done on the debian fetcher) which I feel is a much more reliable method than using noarch by default. It might be better to make the subdir/architecture&os required than using noarch.

asides, subdir isn't really a namespace, it's just an architecture and os folder grouping of the packages (i.e. linux x64 packages -> /linux-64, windows x64 packages -> /windows-64)

lamarrr commented 4 months ago

I also assume the type coordinate in both proposals would still be "conda" or "condasource". Also, if there is a GH or GitLab source location present in the channel / package metadata, is the intent also to populate the same in the definition?

I don't get you, what definition? If it is a condasource type, it is sourced from whatever source url or source git url (git is NOT always the source) is provided at the package's channel index. if it is a conda type (architecture-dependent) it is sourced from conda's server source url

bduranc commented 4 months ago

I also assume the type coordinate in both proposals would still be "conda" or "condasource". Also, if there is a GH or GitLab source location present in the channel / package metadata, is the intent also to populate the same in the definition?

I don't get you, what definition? If it is a condasource type, it is sourced from whatever source url or source git url (git is NOT always the source) is provided at the package's channel index. if it is a conda type (architecture-dependent) it is sourced from conda's server source url

Basically, a "definition" == component in ClearlyDefined.

Using Maven as an example: https://clearlydefined.io/definitions/maven/mavencentral/com.googlecode.openbox/maventools/2.0.1

and it's corresponding "source" definition (Maven sourcearchive): https://clearlydefined.io/definitions/sourcearchive/mavencentral/com.googlecode.openbox/maventools/2.0.1

Or another example that has a GitHub repo maintained as it's source location field instead of the Maven sourcearchive: https://clearlydefined.io/definitions/maven/mavencentral/io.eliez/mavenJava/2.0.1

lamarrr commented 4 months ago

I have now changed the delimiters and coordinate specification to:

{type: conda|condasource}/{provider: anaconda-main|anaconda-r|conda-forge}/-/{package name}/[{archictecture | _}:][{version | _}]-[{build version | _}]/[{tool version}]
conda/conda-forge/-/numpy/linux-aarch64:1.13.0-py36/ - complete coordinate
conda/conda-forge/-/numpy/-py36/ -- any version with build hash py36*
conda/conda-forge/-/numpy/1.13.0-py36/ -- version with build hash
conda/conda-forge/-/numpy/linux-aarch64:_-py36/ -- architecture and build hash
conda/conda-forge/-/numpy/linux-aarch64:1.13.0/ -- architecture and version
conda/conda-forge/-/numpy/ - any
conda/conda-forge/-/numpy/_:_-_ - any
lamarrr commented 4 months ago

I have now changed the delimiters and coordinate specification to:

{type: conda|condasource}/{provider: anaconda-main|anaconda-r|conda-forge}/-/{package name}/[{archictecture | _}:][{version | _}]-[{build version | _}]/[{tool version}]
conda/conda-forge/-/numpy/linux-aarch64:1.13.0-py36/ - complete coordinate
conda/conda-forge/-/numpy/-py36/ -- any version with build hash py36*
conda/conda-forge/-/numpy/1.13.0-py36/ -- version with build hash
conda/conda-forge/-/numpy/linux-aarch64:_-py36/ -- architecture and build hash
conda/conda-forge/-/numpy/linux-aarch64:1.13.0/ -- architecture and version
conda/conda-forge/-/numpy/ - any
conda/conda-forge/-/numpy/_:_-_ - any

Seems the file indexer encodes the coordinates into files directly from the coordinate spec, meaning paths separated with : might not work. I have changed the spec to this:

{type: conda|condasource}/{provider: anaconda-main|anaconda-r|conda-forge}/-/{package name}/[{archictecture | _}--][{version | _}]-[{build version | _}]/[{tool version}]
conda/conda-forge/-/numpy/linux-aarch64--1.13.0-py36/ - complete coordinate
conda/conda-forge/-/numpy/-py36/ -- any version with build hash py36*
conda/conda-forge/-/numpy/1.13.0-py36/ -- version with build hash
conda/conda-forge/-/numpy/linux-aarch64--_-py36/ -- architecture and build hash
conda/conda-forge/-/numpy/linux-aarch64--1.13.0/ -- architecture and version
conda/conda-forge/-/numpy/ - any
conda/conda-forge/-/numpy/_--_-_ - any
capfei commented 3 months ago

I don't have much knowledge in the crawler and conda but was able to get it running locally. This looks good to me and fine with the naming convention.