clearlydefined / crawler

A service that crawls projects and packages for information relevant to ClearlyDefined
MIT License
43 stars 30 forks source link

Conda Crawler Support #535

Closed lamarrr closed 2 weeks ago

lamarrr commented 5 months ago

Background

Conda exposes packages in a different format from other python repositories like pypi. Conda is a python environment locked to a specific python version. Conda deals with packages locked to a specific version for a version of the channel, this ensures packages do not break due to one incompatibility or another as the packages are managed for compatibility, similar to how you'd ship a docker container. The primary consumption point is the "packages" themselves which are accompanied with scripts to modify the environment and setup the packages and dependencies which are then consumed by the setup application, the packages may also contain DLLs, scripts, compiled python binary (.pyc), python code. etc. The structure of conda repositories and their indexing process are described here: https://docs.conda.io/projects/conda-build/en/stable/concepts/generating-index.html

Conda has three main channels: anaconda-main, anaconda-r, and conda-forge which is more geared towards business uses

We crawl both the packages and the source code (not always specified) for the licensing metadata and other metadata about the package.

the source from which the conda packages are created from are often but not always provided via a url which links a compressed source file hosted externally, sometimes via github, or another website. note that this is a file and not a git repository. the main conda package is hosted on the conda channels themselves and are compressed and contain necessary licensing information, compilers, environment configuration scripts, dependencies, etc. that are needed to make the package work.

The crawler uses the coordinates of the syntax:

type: conda | condasource
provider: conda-forge | anaconda-main | anaconda-r
namespace: -
name: any
revision: (((${version}|-)_(${architecture}|-))|-)
toolVersion: (${toolVersion}|-)

i.e.

conda/conda-forge/-/numpy/1.13.0_linux-aarch64/py36
condasource/conda-forge/-/numpy/_
conda/conda-forge/-/numpy/-/py36
conda/conda-forge/-/numpy/1.13.0_/py36
conda/conda-forge/-/numpy/_linux-aarch64/py36
conda/anaconda-main/-/numpy/_/py27
conda/anaconda-main/-/numpy/_/-

where type (required): conda or condasource package name: name of the package provider (required): channel on which the package is to be crawled from. conda-forge, anaconda-main, or anaconda-r revision (optional): package version and architecture i.e. 0.3.0_win64. if it is a conda coordinate type and no architecture is specified any architecture is chosen. condasource type packages don't need the architecture revision tag as they are not architecture specific toolversion (optional): the build version of the package, this is usually a conda-specific representation of the build tools and environment configuration, and build iteration of the package. i.e. for a python 3.9 environment, this could be py39H443E. if none is specified, the latest one will be selected using lexicographical order.

Conda-forge is a community effort and packages are published by opening PRs on their github repository as described here https://conda-forge.org/docs/maintainer/adding_pkgs.html