fsfe / reuse-docs

REUSE recommendations, tutorials, FAQ and specification
https://reuse.software
19 stars 20 forks source link

Define precedence of information with REUSE.yaml #70

Open mxmehl opened 4 years ago

mxmehl commented 4 years ago

We should decide on a precedence chain for which the tool (and our spec) can decide which licensing/copyright information is the determining one if there are conflics, e.g. between the source code file and the respective .license file.

All is based on the assumption that we will introduce a REUSE.yaml file.

There are three opions that make sense in their own ways. Which one do you prefer?

Option 1: most intuitive

Rationale: The closer to the file, the more weight the information should have

  1. file content
  2. file.license
  3. adjacent REUSE.yaml (no matter if explicit or glob)
  4. faraway REUSE.yaml

Option 2: override compromise

Rationale: Allow to override file contents, e.g. if there are misleading strings which confuse the REUSE tool. Otherwise, as Option 1

  1. file.license
  2. file content
  3. adjacent REUSE.yaml (no matter if explicit or glob)
  4. faraway REUSE.yaml

Option 3: most tooling robust

Rationale: Similar to Option 2, but make a difference between explicit and glob file definitions

  1. file.license
  2. explicit in adjacent yaml (so file directly covered)
  3. explicit in faraway yaml
  4. file content
  5. glob in adjacent yaml (file covered e.g. as part of *.png)
  6. glob in faraway yaml
oddhack commented 3 years ago

I like #3. I've been doing REUSE compliance for some repositories which import multiple other repos which are not REUSE-compliant and aren't likely to be, so globbing is important.

mxmehl commented 3 years ago

I like #3. I've been doing REUSE compliance for some repositories which import multiple other repos which are not REUSE-compliant and aren't likely to be, so globbing is important.

Ah, I think there's a misunderstanding. Globbing would be possible in each scenario. No3 just makes a difference on whether a file is defined explicitely or via a glob, and how this changes depending on the distance of the YAML file. I tried to make that a bit clearer above.

oddhack commented 3 years ago

I wasn't entirely clear either. A confusing thing I kept running into was how .reuse/dep5 is parsed such that the last match is the controlling one - someone in another issue pointed to dep5 documentation on how that works, but I didn't see it in REUSE documentation.

So having the processing stages clearly laid out such that explicit matches happen first and in a separate stage from glob matching would be more sensible for me, and better documented (even if just at the level of the comment above). However you do it, I encourage being really clear about how multiple matches to the same file are handled.

robinkrahl commented 3 years ago

For the use case I described in this comment, I’d prefer option 3:

I’m including the source code of a third-party library in a project I maintain. The library uses the deprecated LGPL-3.0 identifier which reuse does not accept. I would like to overwrite these annotations in the dep5 file, but reuse still parses the library files and reports the incorrect license identifier.

Alternatively, would it be possible to add an override field to the entries in REUSE.yaml to disable checking the matched files?

mxmehl commented 3 years ago

For the use case I described in this comment, I’d prefer option 3:

Thank you. May I ask why option 2 would not work for you?

Alternatively, would it be possible to add an override field to the entries in REUSE.yaml to disable checking the matched files?

If possible, my preference would be to have the precedence chain make such manual overrides obsolete to not run into the trap of option overload.

robinkrahl commented 3 years ago

May I ask why option 2 would not work for you?

You are right, it would work too. I just would have to generate multiple .license files which is rather tedious, especially when I have to merge in new versions from the upstream project. Specifying the information in a single location would be easier.

silverhook commented 3 years ago

For practical reasons, I would prefer №3. It is more complex, but that complexity does bring with it flexibility to make REUSE actually useful in more complex (dare I say real life) scenarios where 3rd party code mixes with 1st party code and a tonne of non-editable files.

Still, in the spec and FAQ (perhaps even in the tool’s output) we should continue to emphasise the importance of having the licensing/copyright info in the files themselves, if at all possible. As then this info does not get lost if a file is taken outside of its home context.

mxmehl commented 3 years ago

You are right, it would work too. I just would have to generate multiple .license files which is rather tedious, especially when I have to merge in new versions from the upstream project. Specifying the information in a single location would be easier.

Ah, but in option 1 and 2, you would also have the glob available. Option 3 compared to option 2 just makes a difference on the type of coverage (explicit path vs. glob) when information overlaps.

Excuse me for asking you so much about your rationale ;)

mxmehl commented 3 years ago

For practical reasons, I would prefer №3. It is more complex, but that complexity does bring with it flexibility to make REUSE actually useful in more complex (dare I say real life) scenarios where 3rd party code mixes with 1st party code and a tonne of non-editable files.

I am starting to get the same feeling. I am just not so happy with having the file content ranked that low, given our actual priority and the usefulness for human readers.

oddhack commented 3 years ago

While there are a couple of different ways of presenting it, for me #3 boils down to

That may make it a little more clear why I'm in favor? Having 3 ways to specify the explicit override does seem a bit overkill-ish. Though in our usage thus far we have completely avoided .license files as they clutter the repository - a lot, if you have e.g. hundreds of images.

robinkrahl commented 3 years ago

Ah, but in option 1 and 2, you would also have the glob available. Option 3 compared to option 2 just makes a difference on the type of coverage (explicit path vs. glob) when information overlaps.

Yes, but in options 1 and 2, the information in the file has precedence overthe glob, right? My problem was that the annotations in the file use an outdated license specifier, so I don’t want the files to be parsed.

mxmehl commented 3 years ago

Yes, but in options 1 and 2, the information in the file has precedence overthe glob, right? My problem was that the annotations in the file use an outdated license specifier, so I don’t want the files to be parsed.

Yes, but the same applies to option 3 where the glob has a lower priority than the file content. To override a file content, option 2 and 3 would require an explicit override (while in option 1 this is not possible, and therefore not practical). Either as a .license file (option 2 and 3), or as an explicit mention in the YAML file (option 3).

Say that you want to override the content in src/code.py for any reason:

Option 1

Not possible as file content ("as close to the file as possible") is authoritative.

Option 2

  1. Create src/code.py.license

Option 3

  1. Create src/code.py.license
  2. Write in your YAML file (again, this format is not specified yet, but on the roadmap):
- src/code.py:
    license:
      SPDX-License-Identifier: MIT
    copyright: |
      SPDX-FileCopyrightText: 2020 Me

In this option, a glob would still not override the file, so the following things will not work:

- *:
    license:
      SPDX-License-Identifier: MIT
    copyright: |
      SPDX-FileCopyrightText: 2020 Me

- src/*:
    license:
      SPDX-License-Identifier: MIT
    copyright: |
      SPDX-FileCopyrightText: 2020 Me

- src/*.py:
    license:
      SPDX-License-Identifier: MIT
    copyright: |
      SPDX-FileCopyrightText: 2020 Me

The reason for that: REUSE is designed to allow devs to re-use code by others, who ideally also applied REUSE to their code. So if I just copy someone else's file into my own repo, e.g. in src/, it should be a well-thought step to override the existing information. With using a glob, this can happen accidentially, so that is why globs have the lowest priority in option 3.

As a reminder: the main idea of REUSE is that devs write the copyright/license information in the file headers as this best preserves this kind of information, also for non-REUSE tooling.

robinkrahl commented 3 years ago

Yes, but the same applies to option 3 where the glob has a lower priority than the file content.

I see, thanks for the explanation! In this case, just ignore my previous comments.

buxtonpaul commented 1 year ago

Option 3 for me! I think this in combination with the multiple reuse.yaml will cover mostly all my current issues.

I think in addition that if there are multiple explicit definitions (either yaml, or through .license) there should be a warning generated.