EmbarkStudios / cargo-about

📜 Cargo plugin to generate list of all licenses for a crate 🦀
http://embark.rs
Apache License 2.0
550 stars 32 forks source link

Replace clearlydefined #269

Open Jake-Shadle opened 5 days ago

Jake-Shadle commented 5 days ago

The issues with clearlydefined

The goal of the original integration of clearlydefined was to:

  1. get the benefit of better (deeper/more nuanced) automatic license gathering
  2. take advantage of the ability for users to contribute curations on the machine read data so that individual projects using cargo-about (and in the future cargo-deny) could get the most accurate license information without needing to specify the same clarification over and over again.

This was, in retrospect, doomed to failure.

While clearlydefined does give slightly better license discovery in some cases, it's not significantly better, as well as being too aggressive ie. including licenses in test data/code not actually used by downstream crates, and even just broken in some cases eg. generating Apache-2.0 AND LLVM-exception SPDX expressions instead of Apache-2.0 WITH LLVM-exception

The idea of curations was what really sold me on CD, as it meant that users could "fix" a crate once, rather than every project needing to apply the same fix for the crate. The reality is that almost no one in the Rust community is even aware that clearlydefined exists, much less make curations to improve the gathered data. And (unless this has changed) the fact that clearlydefined treats every crate version as a completely separate unique entity means that even if users do contribute curations they need to applied to every future version of the crate (unless the crate is "fixed" in the actual source, which is the best, though not always available option) which is a frankly ridiculous requirement when 99% of versions for most crates are going to have the same exact license terms every time.

In addition, clearlydefined doesn't actually harvest (license scan) crates until they are explicitly told to or, thankfully, just requested. However, clearlydefined is...egregiously slow and it can take literally days for it to actually harvest a crate since it supports many other ecosystems like NPM etc, so in practice the hit rate can be quite low depending on what crates and/or versions you are requesting. This could be mitigated by falling back to older versions of crates that are more likely to be harvested (since as stated, 99% of the time they license is going to be the exact same), but it's just another fallback to cover up for clearlydefined deficiencies.

When those shortcomings are coupled with the fact that clearlydefined is one of the worst performing sites I have seen in my 3 decades on the internet and the API not being much better as it is basically a coin flip for every request whether it will return a 200 or a 502 (or even just timeout sending the request!) means that in many cases the clearlydefined data isn't even being used because it can't be retrieved in a timely manner. Sure, maybe cargo-about could meter requests better (see #218) but again, it's another workaround for something that should be fast and reliable.

General idea

I haven't completely thought through everything, but essentially I would want a similar setup to how rustsec/advisory-db works, a simple Github repo with a simple file structure that can accept contributions similar to how cargo-about clarifications work today.

This ~solves~ alleviates several issues:

  1. A vast majority of crates don't need clarifications as most are simply Apache-2.0 OR MIT and include the MIT and Apache-2.0 licenses in the package as separate files, the repo of clarifications would only need to contain fixes for the outliers that need them, unlike the clearlydefined approach of keeping entire copies of the entirety of every crate (that gets harvested).
  2. github.com is not super fast, but it is highly available and consistently "ok" speeds
  3. While cargo-about could cache data from clearlydefined, I haven't bothered implementing that as cargo-about is something generally run in CI without caching, or run infrequently enough that caching has lower value, especially if curations are made. But caching is trivial with a git repo if it's desired since it's just a directory that can then be updated to the remote HEAD to pick up any additions that have happened since the "cache" was updated. The source tarballs could also be used instead if desired.
  4. Crates published to crates.io could be automatically scanned and flagged if the license data is unclear since this can be done in seconds rather than hours/days like clearlydefined

But let's back up a bit, why do we even need these clarifications (or curations in clearlydefined lingo)?

The problems

Making cargo-about (and to a lesser extent cargo-deny) and using it in real world projects with many external dependencies I started to realize that a minority (but still a lot in absolute terms) of Rust projects have issues around their licenses.

One issue is that projects with multiple licenses will concatenate all of the licenses together into a single file, which breaks how cargo-about determines the license expression of a file via askalono, the "canonical" example being ring.

Another issue is that crates may not actually fill out the license field (or even the license-file field, though crates.io requires at least one) which means that cargo-about has to assume that any license it discovers must be part of an SPDX expression where every license is required via an AND operator since that is the maximally restrictive expression, even if it doesn't make sense in the case that two licenses can't be required simultaneously.

Another issue is that many crates have published versions where the crate package doesn't contain the license text(s) (a requirement for basically every license). This is usually due to the licenses being in the repo root but having multiple crates as subdirectories that used those licenses, but because they license texts weren't in the crate's root directorty beside its manifest file, weren't packaged during crate publishing.

And, while all of these issues should be fixed in the original crate so that there is no need for a clarification...that doesn't fix any previous versions of a crate that were incorrect since packages published to crates.io (and hopefully every other registry) are immutable. There's also cases where a crate might be "done" and is archived or the maintainer is otherwise incommunicado or uninterested in taking contributions meaning the only way to "fix" it is to either fork and republish which can be more work than one wants to do at the time, so having a (hopefully) easier way to contribute to the fix to a separate repo mitigates those issues.

End

In closing, this is something that I have been thinking about for cargo-about (and cargo-deny) for a while, I've just been kind of avoiding it because the idea of creating a repo specifically for these kind of license clarifications is a bit daunting since I'm just one person who has a ton of other things to do, and while creating tooling to have nice workflows for this would be fun, over time I can see such an endeavor being exhausting.