clearlydefined / curated-data

Contains curations submitted by the community
Creative Commons Zero v1.0 Universal
119 stars 78 forks source link

How to deal with derived/binary files? #10303

Closed mickaelistria closed 2 years ago

mickaelistria commented 3 years ago

I'm looking at several entries of libraries coded in Java or TypeScript; those do have license headers and so on, they're very clean from source perspective. However, the license score is very low because the derived files (.class or .js files) are binary-sh and don't contain license information which got lost at compilation. Is there some configuration possible on the project to get a better license-score for that case when binary is... binary?

mickaelistria commented 3 years ago

This has been ignored for a long time. Is that another better place I should use to address this question?

jeffmcaffer commented 3 years ago

The idea is that folks fallback to the associated source for information about a binary package. As you have seen, the definitions in ClearlyDefined are very specific to a particular provider and version. So a maven package is separate from the corresponding source package. They have a sourceLocation connection that is itself curatable. 

Do you know if the Eclipse provisions allow for falling back to the source definition in ClearlyDefined to get the score? If they do, then we just need to work on ensuring that the binary packages have good source location data. If not, perhaps we should chat with @waynebeaton et al and see about changing that in the Eclipse Process.

mickaelistria commented 3 years ago

What I fail at understanding is that the "License score" doesn't seem to react to the sourceLocation curation: the score remains computed only from the initial (binary) artifact. Is my understanding correct, or did I get into some corner case? I would expect that the score computation itself would fail back to source when set; or that there would be 2 different sub-scores a "binary license score" and a "source license score". I have the impression that would be good enough for Eclipse Foundation usage; but I'll let @waynebeaton infirm or confirm that.

jeffmcaffer commented 3 years ago

right. the scores are independent and we leave it to the consumers to decide if/how they want to combine/fallback/... Several approaches were suggested/discussed in the past and in the end there was no clear winner. In the interest of full transparency, the general case here is not quite as easy as just taking the source component's score. In some cases the source is in a monorepo or spread over multiple repos. Of course, in some ecosystems there is a strong 1:1 correlation so this problem is moot. I mention it to illustrate why we don't have a general solution.

It would be great to understand how this might play out in the Eclipse process.

mickaelistria commented 3 years ago

In the interest of full transparency, the general case here is not quite as easy as just taking the source component's score. In some cases the source is in a monorepo or spread over multiple repos. Of course, in some ecosystems there is a strong 1:1 correlation so this problem is moot. I mention it to illustrate why we don't have a general solution.

I agree that all the strategies require to somehow trust that the source and the artifact do match. And without a rebuild and a reproducible build, it's about never certain. However, I don't get how there can be "no clear winner": currently ClearlyDefined does use binary to get license/IP info; and it's pretty clear that in many cases, this license/IP info won't be there and so the data of ClearDefined won't be really useful and the score for the binary will always be disappointing. In most cases I see of modern components (written in Java and turned to .class, or in JS or TS and turned to webpack, or in languages that turn into native code), the license/IP analysis of the binary artifact is not going to give interesting data. So I do strongly believe that to be more profitable and relevant, ClearlyDefined should build a "Binary License Score" and a "Source License Score", assuming -like it already does- that the definition of Source does match the binary enough.

jeffmcaffer commented 3 years ago

Fair points. Will have to think it through a bit more and get others (@nellshamrell and @fossygirl in particular) to chime in.

nellshamrell commented 3 years ago

@mickaelistria Thank you for bringing this to our attention.

I could see use doing a "Binary License Score" and "Source License Score" in situations where we have a clear mapping to a source repo (which is the case for many but not all types of ClearlyDefined components).

What I'm interested in understanding (from your org's POV) is how you would use both those scores to make decisions. Would you combine them somehow? Would you go with the one that is higher? I'd love to hear more about your use case.

mickaelistria commented 3 years ago

@nellshamrell Thanks for your interest. I cannot officially answer your question, but summon @waynebeaton to do so. (My impression is that the Eclipse IP rules only care about sources as they only require "consuming developer" to share source during the analysis, and they trust this consuming developer has ensured the sources they provide for analysis do match the binary they use, so Eclipse Foundation would only use the "Source License Score"; but all that comment isn't worth much unless @waynebeaton labels it as accurate)

waynebeaton commented 3 years ago

My understanding of how to use the ClearlyDefined data to support the implementation of the Eclipse Foundation's IP Policy continues to evolve. Currently, I'm not taking a lot of interest in the scores at all, but instead am focusing on the declared and discovered licenses. I haven't removed consideration of the score from our tools, but am strongly considering it (or at very least lowering the threshold from 75).

That there is not a reliable 1:1 mapping between binary and source artifacts on ClearlyDefined (and completely understand why this is hard), we've mostly been using (and curating) license data from the binary artifacts. There are clear deficiencies with this that I haven't sorted out how (or whether) to address.

When we run our own scans, we try to identify the most obvious source. e.g., the source JAR from Maven central or the source ZIP from npmjs.

mickaelistria commented 3 years ago

@waynebeaton At the moment, most of the cases that require further action at Eclipse.org because ClearlyDefined scores are too low are Maven Java libraries or npm modules written in TypeScript, with most file legal headers lost in tranlation. The workaround for Eclipse committers, according to Eclipse Foundation rules, is then to open a ticket and attach the source (usually directly the attached source artifact for Maven). So am I right to think that if some ClearlyDefined score were directly built for the source (additionally to the binary), then it would be good enough to look at that "source" score (which would usually be around 80% rather than 60%) without need to change the acceptance threeshold? I have the impression that, independently of Eclipse Foundation, license checks are only reliable on source anyway; and that trusting that binary and source are matching is another step after. Basically, the story of "trusting an OSS component" is made of 2 points: trusting the source and then trusting some package matches the source. Trying to directly trust binary (from a legal declaration perspective at least), seems to go too fast and miss many cases by the way.

waynebeaton commented 3 years ago

TL;DR: probably.

FWIW, you can feed the license tool "source archive" ClearlyDefined coordinates instead of Maven GAVs and it will do exactly what you're describing.

I have been thinking three things: (1) that setting the default type/source to "maven/mavencentral" is not correct; (2) the default type/source should be something that can be overridden; and (3) I'm still pretty convinced that we only actually care about the licenses and not the ClearlyDefined score.

open a ticket and attach the source

I understand that this is beside the point, but the functionality to do this automatically is getting better.

This discussion is, I believe, more about EF policy than ClearlyDefined, so we should move it to the Dash License Tool.

ariel11 commented 2 years ago

@nellshamrell - is this Issue ready to be closed?

nellshamrell commented 2 years ago

@ariel11 yes, closing. If anyone would like to discuss further, feel free to re-open.