This is the service side of clearlydefined.io. The service mainly manages curations, human inputs and corrections, of harvested data. The ClearlyDefined crawler does the bulk of the harvesting so here we manage the open source/crowd-sourced part of ClearlyDefined. Users come together to add data, review data, and propose upstream changes to clarify the state of a project.
Like other open source projects, ClearlyDefined works with contributions coming as pull requests on a GitHub repo. In our case, curations data changes and are contributed to the ClearlyDefined curated-data repo. Those PRs are reviewed, discussed and ultimately merged into the curation repo. From there this service builds a database that further merges automatically harvested data with the newly curated data and makes it available via REST APIs.
In effect the curated data for a project is a fork of the project. Like most forks, we don't want to maintain changes as they quickly rot and need constant care and attention. Besides, the stated goal of ClearlyDefined is to help projects become more successful through clear data about the projects. The best way to do that is work with the upstream projects to include the data directly in projects themselves.
Unless of course you are working on it, you should not need to run this service yourself. Rather, you can use https://dev-api.clearlydefined.io for experimental work or https://api.clearlydefined.io for working with production data.
If you do want to run the service locally, follow these steps.
The quickest way to get a fully functional local ClearlyDefined set up (including the service) is to use the Dockerized ClearlyDefined environment setup. This runs all services locally and does not require access to the ClearlyDefined Azure account.
Some parts of this set up may require access to the ClearlyDefined Azure Account.
minimal.env.json
file to the parent directory of the repo and rename it to env.json
and set any property values you need. See below for simple, local setup and the Configuration section for more details. If this repo is colocated with the other ClearlyDefined repos, you can share the env.json
file. Just merge the templates. Any colliding properties names are meant to be shared.cd
to the repo dir and run npm install
npm start
That starts the ClearlyDefined service and has it listening for RESTful interaction at http://localhost:4000. See the Configuration section for info on how to change the port. The REST APIs are (partially) described in the Swagger at http://localhost:4000/api-docs.
You may want to get the sample data. Clone the Harvested-data repo and adjust the FILE_STORE_LOCATION
setting your env.json
to point to the data repo's location.
TBD
Configuration properties can be found at:
This project welcomes contributions and suggestions, and we've documented the details in the contribution policy.
The Code of Conduct for this project is details how the community interacts in an inclusive and respectful manner. Please keep it in mind as you engage here.
package:
type: string
name: string
provider: string
revision: string
source_location:
provider: string
url: string
revision: string
path: string
copyright:
statements: string[]
holders: string[]
authors: string[]
license:
expression: string
TODO
{
"source_location": {
"provider": "",
"url": "",
"revision": "",
"path": ""
},
"copyright": {
"statements": [],
"holders": [],
"authors": []
},
"license": {
"expression": ""
}
}
As a PATCH you only need to provide the attributes you want to add or update, any attributes not included will be ignored. To explicitly remove an attribute set its value to null
.
TODO: Make sure the attribute names are consistent with AboutCode/ScanCode TODO: Include a section where the author's identity and reasoning is provided
TODO
Curation patches will be stored in: https://github.com/clearlydefined/curated-data
type (npm)
provider (npmjs.org)
name.yaml (redie)
Note that the package name may contain a namespace portion, if it does, then the namespace will become a directory under provider and the packageName.yaml will be stored in the namespace directory. For example, a scoped NPM package would have a directory for the scope under provider, and then the packageName.yaml would be in the scope directory. Similarly, for Maven, the groupId would be the namespace, and the artifactId would be the packageName.
type (git)
provider (github.com)
namespace (Microsoft)
name.yaml (redie)
TODO
Harvested data will be stored in: https://github.com/clearlydefined/harvested-data
This location is temporary, as harvested data grows will likely need to move it out of GitHub to scale.
type
provider
namespace -- if none then set to '-'
name
revision
tool
toolVersion -- this is the native output file. If more than one file then they should be archived together
How to handle different versions of scanners?
Do we merge results from different versions of ScanCode? How does this impact curation?
Scanning a package where it's actually the source you need to scan, what to store where Maven supports scanning sources JAR or scanning GitHub repository
How to handle tags?
Need to define "origin" and/or pick another term
How do we handle case sensitivity?
Define how to do the linking
The format of harvested data is tool-specific. Tool output is stored in the tool's native output format. If there is a choice between multiple output formats then the priorities are:
Build and run the container.
docker build -t ort .
docker run --mount type=bind,source="<path to repo>",target=/app ort scanner -d /app/output/package-json-dependencies.yml -o /app/output-scanner