fsfe / reuse-docs

REUSE recommendations, tutorials, FAQ and specification
https://reuse.software
19 stars 20 forks source link

Define syntax and format of REUSE.yaml #81

Open mxmehl opened 3 years ago

mxmehl commented 3 years ago

As discussed in spdx/spdx-spec#502, the SPDX project plans to support a "metadata, pre-document file" that contains specific information about files relative to its position. This follows a request to implement something called REUSE.yaml, first discussed here. This issue is to discuss the exact format and syntax of the file.

Proposed YAML options

In the original discussion, we proposed four different syntaxes. One of them (also disliked by the REUSE team) has been turned down in a SPDX call. I removed two others as they are rather unintuitive and clumsy. Also, I changed the format a bit to comply with the YAML syntax (using * as key name is invalid), and added another option.

Option 1: list

Each list item is a SPDX tag as used in file headers. Easy to read thanks to the -, but all items must be wrapped in " to escape the : which would separate a key from a value – we cannot have multiple keys!

- files: "src/*"
  info:
    - "SPDX-FileCopyrightText: 2020 Me"
    - "SPDX-FileCopyrightText: © 2017 You"
    - "SPDX-License-Identifier: MIT"

Option 2: multi-line string

SPDX tags are just separated by new lines. No - or escaping of : are required. However, indentation must be preserved for all lines!

- files: "src/*"
  info: |
    SPDX-FileCopyrightText: 2020 Me
    SPDX-FileCopyrightText: © 2017 You
    SPDX-License-Identifier: MIT

Option 3: license and copyright as separate keys

We could also separate the two information items. Downside: the keys must be wrapped in " to escape the - in the key name.

- files: "src/*"
  "SPDX-FileCopyrightText":
    - "2020 Me"
    - "© 2017 You"
  "SPDX-License-Identifier": MIT

Background on the YAML keys

Unlike the SPDX YAML format, we would like to avoid copyrightText and licenseDeclared as key names. In REUSE, the SPDX-License-Identifier and SPDX-FileCopyrightText (or alternatively traditional, varying copyright statements) are common and understood by the users.

This was also accepted in the SPDX call.

Possible targets

REUSE.yaml is intended to target files that are relative to its position, and only those that are "below".

Statements like files: "../../src/*" should not be possible.

Supporting traditional copyright statements?

A related question is whether we should only support SPDX-FileCopyrightText as indicator for files' copyright, or also "traditional" statements like "Copyright © 2021 Jane Doe".

REUSE recommends the SPDX tag, but also supports the traditional statements. My suggestion would be to do the same in REUSE.yaml to reduce friction, but in SPDX this could lead to conflicts. Happy to collect opinions here!

Globbing

DEP-5 uses a simple glob syntax. In this, */Makefile would include any Makefile in all paths below. I am not sure whether this globbing is represented in any native Python module. The benefit of sticking with the DEP-5 glob is that we could more easily convert existing DEP-5 files to REUSE.yaml.

Another possibility would be using the Python-native glob. */Makefile would only match a Makefile in one level below, while **/Makefile would match all Makefiles.

We could also use pathspec, supporting the same globbing as gitignore.

Conflict resolution

As in DEP-5, I would suggest that the last match of a file wins. So if the file foo.txt is first matched by * and then *.txt, the last statement would count.

The dependecy resolution within REUSE and its different options – including REUSE.yaml – is discussed in #70.

silverhook commented 3 years ago

When it comes to YAML flavours I think all should be OK – I guess we would use an external parser and linter anyway, right?

For files that reuse.yaml should target, I agree it should only affect its siblings and children. Parents etc. should be out of scope.

Regarding traditional copyright statements, I think it is reasonable to expect an SPDX tag, but after it, it should be free text form. Non-SPDX-tag statements were accepted before for legacy reasons. The YAML file is going to be new, so no legacy exists for it. Even if someone has a preferred format, they can just prepend it with SPDX tag.

Globbing – no preference, as long as it’s something that is in common practice and coherent.

Conflict resolution – I agree with your proposal.

Jayman2000 commented 2 years ago

I think that the syntax should avoid the strings “SPDX-License-Identifier:” and “SPDX-\<tagname>:”. Those strings are likely to cause false positives. Tools that aren’t REUSE.yml aware will mistakenly assume that the data applies to REUSE.yml. Here’s my proposal:

Option 1: list

- files: "src/*"
  info:
    - "FileCopyrightText: 2020 Me"
    - "FileCopyrightText: © 2017 You"
    - "License-Identifier: MIT"

Option 2: multi-line string

- files: "src/*"
  info: |
    FileCopyrightText: 2020 Me
    FileCopyrightText: © 2017 You
    License-Identifier: MIT

Option 3: license and copyright as separate keys

- files: "src/*"
  "FileCopyrightText":
    - "2020 Me"
    - "© 2017 You"
  "License-Identifier": MIT

If we do decide to drop the “SPDX-”, then I would recommend option 3. That way, if someone makes a mistake and includes the “SPDX-”, they have to do less to fix it.

I would also recommend making the REUSE Tool give a helpful error when this mistake happens. For example, it could say “Found ‘SPDX-License-Identifier’ in REUSE.yml. In REUSE.yml, use ‘License-Identifier’ instead (no ‘SPDX-’).”

silverhook commented 2 years ago

Great catch, @Jayman2000! What you write makes sense to me. It does provide some extra complication, but seems worth it to me in order to avoid future issues.

andrewshadura commented 2 years ago

Why not rename FileCopyrightText to copyrights and License-Identifier to license? A similar format is already used by scan-copyrights.

mxmehl commented 2 years ago

Why not rename FileCopyrightText to copyrights and License-Identifier to license? A similar format is already used by scan-copyrights.

We would like to have the SPDX project make this part of their spec, too, in order to not create conflicts with other compliance tools and practices (see: spdx/spdx-spec#502).

In SPDX, there are multiple "license" fields for instance, e.g. the concluded or declared license. I am afraid that this unclear terminology would not pass SPDX. However, main goal is to avoid confusion: so either we stick with the tags that are already used in REUSE (except in DEP5) or we make them really simple (as you suggested).

andrewshadura commented 2 years ago

We would like to have the SPDX project make this part of their spec, too, in order to not create conflicts with other compliance tools and practices (see: spdx/spdx-spec#502 https://github.com/spdx/spdx-spec/issues/502).

Why I can see why you might want that, I'm not sure that's a goal worth pursuing. One of the reasons I keep my usage of SPDX to the minimum is its verbosity. I fear if and when your proposal is merged into SPDX, it's going to become yet another verbose way of specifying licensing information people will avoid.

I'm also unsure why you want to deprecate DEP-5, which in my view is superior to many other similar formats. If something isn't quite right in it, I'd personally try to evolve it into a machine-readable copyright format 2.0 rather than abandon it completely.

floriansnow commented 2 years ago

Most of this looks good to me. I would like to add my two cents in regards to two things:

andrewshadura commented 2 years ago

I don’t think YAML vs JSON is an issue with Python: there are multiple YAML libraries for Python (pyyaml, ruamel, strict-yaml), so YAML is quite well-supported. JSON is much less readable even when pretty-printed, it requires commas between list elements but not after, and I wouldn’t count on anything that generates it to actually pretty-print it. In my experience most generated JSON was dumped onto a single endless line, and most generated YAML was formatted and human-readable.

floriansnow commented 2 years ago

JSON is in the standard library and json.dump() supports decent printing with the indent parameter. Perhaps strict-yaml could serve a similar purpose, but most of the time, JSON is the stricter, more well defined version of YAML IMHO.

mxmehl commented 2 years ago

Why I can see why you might want that, I'm not sure that's a goal worth pursuing. One of the reasons I keep my usage of SPDX to the minimum is its verbosity. I fear if and when your proposal is merged into SPDX, it's going to become yet another verbose way of specifying licensing information people will avoid.

The files we intent to use have not much in common with a full SPDX SBOM, for which I agree that they are impossible to parse for humans. However, making REUSE's labelling compatible with an ISO standard has the great advantage that the likelihood of being compatible with other tools and best practices is much higher.

I see the advantage of creating own specs, but following the practice of "not invented here" even if there are somewhat good alternatives has only seldomly advanced technology.

I'm also unsure why you want to deprecate DEP-5, which in my view is superior to many other similar formats. If something isn't quite right in it, I'd personally try to evolve it into a machine-readable copyright format 2.0 rather than abandon it completely.

Please read the full discussion and proposal that I've linked in the first post. There are good reasons why DEP-5 is not ideal for our purpose: https://lists.fsfe.org/pipermail/reuse/2020q3/000085.html

nicorikken commented 2 years ago

Reading the discussion on https://github.com/spdx/spdx-spec/issues/502 one thing stands out to me, the desire to align with the SPDX YAML. I think the current thoughts best align with the files section. The packages section apparently is of interested to the community as listed in the same thread, but that might be out of scope for now. So I think we need to look closer at https://github.com/spdx/spdx-spec/blob/e25d183ade64c123770412297b9bf5086a7ed0bf/examples/SPDXYAMLExample-2.2.spdx.yaml#L241

Based on that I would consider a file like:

---
spdxVersion: "SPDX-2.3" # mandatory to allow future spec changes
creationInfo: # optional
  comment: "Easily add metadata to image files."
  created: "2022-05-25"
  # and other metadata if desired
# FIXME: perhaps needs information that this is to be considered input, not output
files:
# In line with SPDX YAML output
- copyrightText: "Copyright Photographer X"
  fileContributors: ["Photographer X"] # optional
  licenseConcluded: "CC-BY-4.0"
  fileName: "./images/other-author.jpg"
# My main proposal for simplicity
- fileGlob: "./images/*.jpg" #or another term, but to differentiate from 'fileName'
  copyrightText: |
    Copyright 2022 Photographer X
    Copyright © 2022 Image editor Y
  fileContributors:
    - "Photographer X"
    - "Image editor Y"
  licenseConcluded: "CC-BY-4.0" # I don't see a reason to change the key, or is there?

I know the format is quite different from earlier proposals:

I step into this discussion quite late, so feel free to point out my false reasoning.

Tachi107 commented 2 years ago

Please read the full discussion and proposal that I've linked in the first post. There are good reasons why DEP-5 is not ideal for our purpose: https://lists.fsfe.org/pipermail/reuse/2020q3/000085.html

Apart from having to put the file in .reuse/, what's the issue with dep5? I might be biased as I'm involved in Debian stuff, but it seems that so far that format has served users well (well defined, widely used, easy to write, concise).

Instead of creating a new YAML format, have you considered extending dep5 support so that it is possible to put files at any directory level? Like what you are proposing with REUSE.yaml, users would be able to create different dep5 files named REUSE.dep5 at any point in their directory hierarchy. This would fix one major limitation of the current dep5 integration, while avoiding annoying users that would have to migrate their (possibly large) .reuse/dep5 files to a new incompatible format.

Also, from the linked email:

The first downside of DEP5 is that the tags are different from the normal SPDX/REUSE tags

Using License instead of SPDX-License-Identifier isn't that big of a deal IMO, as the extra verbosity of the file tag is needed so that it can be easily extracted from general files- an ad-hoc file doesn't need extra qualifiers. As for Copyright, it is a REUSE tag. Also, judging from the proposals above, it seems that keys would also differ in this new format (copyrightText vs SPDX-FileCopyrightText and licenseConcluded vs SPDX-License-Identifier).

[dep5] requires some other meta information out of REUSE's scope

The only required information that's not directly related to REUSE is the Format key, that would be needed in a custom YAML format anyway to allow format changes.

On the other hand if this YAML format gets standardized as an official SPDX format and it is not too verbose it would be nice to adopt it instead :)

Edit: forgot to mention, but implementation details such as Python's standard library support for YAML, JSON, etc should not be a high priority (I wouldn't consider them at all... one of the points of standardizing a format is the possibility of having different interoperable implementations, regardless of the programming language used)

pietroalbini commented 1 year ago

@mxmehl to followup on the issues I identified in https://github.com/rust-lang/rust/pull/99415#issuecomment-1219355327, I'm wondering whether Tachi's proposal of a REUSE.dep5 file rather than (or in addition to) REUSE.yaml would be accepted.

The discussion to define the YAML format seems to have stalled on the SPDX side, and implementing REUSE.dep5 seems to require way less design work and consensus gathering, at least from my outside perspective.

silverhook commented 1 year ago

Quite the opposite, I’m afraid, @pietroalbini.

There are several points where DEP5 (mostly, but not only, due to historical reasons) differs from SPDX and REUSE.

To use DEP5 in REUSE was a good hack early on, but as it (and SPDX) becomes more wide-spread, the problems, exceptions, workarounds etc. that REUSE would need to do to make DEP5(-ish) usable make it quite an obstacle.

And bending DEP5 to suit REUSE seems to break much more than creating our own SPDX(-derived) YAML format.

andrewshadura commented 1 year ago

I don't get it. The machine-readable copyright format, to which you not quite correctly refer to as DEP5, has been in use in Debian for quite a long time, more than a decade if I remember correctly. So far, as far as I'm aware, we haven't received requests for improvement from Reuse, but if we did, I'm certain they could eventually result in a version 1.1 or even 2.0. After all, the goal of the format was to provide human- and machine-readable way of documenting license and copyright information, so if it didn't fulfill that goal, improving it was never off the table.

The only real downside of it as opposed to a YAML-based format is a need for a parser, but that's been solved ages ago (and also the format is a composition of well-known standards such as RFC 822, so it's not exactly something odd).

-- Cheers, Andrej

pietroalbini commented 1 year ago

@silverhook I understand your desire for a format compatible with the wider SPDX ecosystem! I don't have a preference for either choice myself, but there are currently issues that I'd like to help fix that are blocked on this.

The point I was making was that to adopt REUSE.dep5 there is only a need for consensus within the REUSE project (as the format is already standardized and implemented within REUSE), while defining a YAML format requires resolving the open questions, designing the format, and gathering consensus within SPDX (with a lot more stakeholders in the room).

Of course I'm an outsider to the project, and I don't have many insights on how hard gathering the consensus within the REUSE project would be :slightly_smiling_face:


As I hinted before, I'm working to adopt REUSE in the Rust compiler, and we're facing some blocker issues:

I'm willing to help with some implementation work to solve the two issues I mentioned above, but designing and gathering consensus in SPDX for a suitable format is going to take more time than I can commit.

To be clear, I don't want to pressure you into making a choice you don't like just because we want to adopt REUSE in the Rust project. If we can't find a solution in the near term to those issues, we'll just have to create our own bespoke tooling and wait for those issues to be addressed before reconsidering REUSE.

Tachi107 commented 1 year ago

Citing @silverhook:

To use DEP5 in REUSE was a good hack early on, but as it (and SPDX) becomes more wide-spread, the problems, exceptions, workarounds etc. that REUSE would need to do to make DEP5(-ish) usable make it quite an obstacle.

As I asked in https://github.com/fsfe/reuse-docs/issues/81#issuecomment-1146888221, could you please explain why DEP5 doesn't currently suit REUSE's needs? Yes, it doesn't support all SPDX's features, but neither does REUSE. As far as I understand, SPDX's scope is far broader than just handling licensing information, while REUSE's goal is to "Make licensing easy for everyone", and DEP5's simple and limited format perfectly aligns with this goal, as I've been able to observe in different open source projects.

I don't know your plans for the future of REUSE, so I'm of course missing something. Hence, would you please help us better understand your point? Thanks :)

carmenbianca commented 1 year ago
  1. REUSE and Debian use DEP5 for very different purposes. In Debian, DEP5 is a comprehensive way to declare the copyright and licensing of a project. In REUSE, its design intent is a fallback to declare copyright and licensing for scenarios where headers or .license files are impossible or unwanted. You're not really supposed to copy a debian/copyright from Debian into the .reuse/dep5 of an upstream project. I outlined the reasons for this here. Using a non-DEP5 format helps underscore the difference in purpose.

  2. The python-debian dependency is not satisfactory:

  3. This issue doesn't reflect it, but we're thinking of extending the proposed syntax/format in this issue to define precedence (#70 adjacent) and overriding. I'm not entirely sure how DEP5 does precedence at the moment, but the results from DEP5 and the file headers are aggregated with no toggle to change this behaviour. We could put this toggle next to the glob in REUSE.yaml. Furthermore—and this issue also doesn't reflect this—but we could further extend the syntax to enable a glob scenario such as 'all files in docs/* except those with a certain file extension'. We get a lot more wiggling room for granularit when using a different format.

  4. This is subjective, but I think there's value in putting the configuration in a file format that developers are already familiar with. Right now, developers kind of have to divine how to write valid DEP5 from example, but they already know how to write valid YAML.

Tachi107 commented 1 year ago

Thanks for you nice and complete reply!

  1. REUSE and Debian use DEP5 for very different purposes. In Debian, DEP5 is a comprehensive way to declare the copyright and licensing of a project. In REUSE, its design intent is a fallback to declare copyright and licensing for scenarios where headers or .license files are impossible or unwanted. You're not really supposed to copy a debian/copyright from Debian into the .reuse/dep5 of an upstream project.

I completely agree with this point. In fact, I find it a bit odd that Rust decided not to add license headers to their files.

  1. The python-debian dependency is not satisfactory:

Yeah, that's true. If I were a Python guy I would've put some effort into moving the DEP5 parser in a separate, less Debian-specific package. But I'm not :/

  1. This issue doesn't reflect it, but we're thinking of extending the proposed syntax/format in this issue to define precedence (Define precedence of information #70 adjacent) and overriding. I'm not entirely sure how DEP5 does precedence at the moment, but the results from DEP5 and the file headers are aggregated with no toggle to change this behaviour. We could put this toggle next to the glob in REUSE.yaml. Furthermore—and this issue also doesn't reflect this—but we could further extend the syntax to enable a glob scenario such as 'all files in docs/* except those with a certain file extension'.

Isn't option one in the linked issue independent of the file format? Also, I think that adding support in DEP5 for a glob like the one you mentioned ("all files in docs/* except those with a certain file extension") is something that could be useful to Debian too. Anyway, yes, DEP5 doesn't support, and likely never will, any overriding mechanism, but please keep in mind that adding such a feature could be a double edged sword - ideally, REUSE.yaml (or REUSE.dep5) should be easily understandable without having to look to much at the documentation.

  1. This is subjective, but I think there's value in putting the configuration in a file format that developers are already familiar with. Right now, developers kind of have to divine how to write valid DEP5 from example, but they already know how to write valid YAML.

I'd argue that DEP5 is way more user friendly than YAML, especially if you've never used neither of those before (and if you're not used to the concept that indentation really matters) - but as you say, this is subjective.

In any case, please keep in mind that Debian really cares about license compliance and copyright attributions (the copyright format was not created by accident!), and I'm sure some Debian folks (including me) would be more than glad to help with REUSE (with regards to evolving DEP5, making the python parser more portable and reliable, etc.) :)

pietroalbini commented 1 year ago

Thanks @carmenbianca for explaining the concerns you all have about using DEP5 for the new file format. Having more clarity on that rationale helps.

I'm wondering then, what are the next steps for this issue? Both of the issues preventing Rust from adopting REUSE are blocked on this issue, and while I have some time to spend on improvements to REUSE, gathering consensus for a format inside SPDX is something I unfortunately can't commit to.

You're not really supposed to copy a debian/copyright from Debian into the .reuse/dep5 of an upstream project. I outlined the reasons for this https://github.com/fsfe/reuse-tool/issues/605#issuecomment-1276560576. Using a non-DEP5 format helps underscore the difference in purpose.

I completely agree with this point. In fact, I find it a bit odd that Rust decided not to add license headers to their files.

Heh, I agree that in an ideal world adding per-file headers would be better, but there is opposition in the Rust project to add those headers, and 5 years ago the project decided to remove the existing headers from the codebase. Having the licensing definitions into a centralized file is the compromised I managed to reach.

mxmehl commented 1 year ago

Thanks for the constructive exchange of opinions and arguments!

Heh, I agree that in an ideal world adding per-file headers would be better, but there is opposition in the Rust project to add those headers, and 5 years ago the project decided to remove the existing headers from the codebase. Having the licensing definitions into a centralized file is the compromised I managed to reach.

I understand. Thanks for what you tried and accomplished!

I'm wondering then, what are the next steps for this issue? https://github.com/fsfe/reuse-docs/issues/81#issuecomment-1274267827 are blocked on this issue, and while I have some time to spend on improvements to REUSE, gathering consensus for a format inside SPDX is something I unfortunately can't commit to.

Understandable. The REUSE team is working on creating a concrete proposal for including this in the next SPDX spec (whenever this will be released...) and will include some stakeholders later in the process to implement feedback early on and reduce friction. No concrete timeline yet and certainly nothing that's done in the next few weeks unfortunately.

silverhook commented 1 year ago

I already did in the REUSE chat, but I hereby publicly volunteer to take on the SPDX side of this. (This is not to contradict @mxmehl , but to support him and perhaps make the public message more clear that people are working on this.)

REUSE snippets support just about got into the last SPDX spec version on time, so there’s ample time until the next revision.

From what I can tell, the way we set up REUSE so far, it shouldn’t be a huge impact on SPDX anyway. So as long as someone keeps an eye that we’re using the right SPDX tags and not misusing them (again, I volunteer for that part), we should be able to draft a full reuse.yaml spec and then if anything at all needs to included into SPDX Spec, sync up with SPDX.

silverhook commented 1 year ago

I’m not happy with this discovery, esp. this late in the development of REUSE.yaml, but it does shed some light why some (apparently rightly so) look negatively on YAML.

https://ruudvanasseldonk.com/2023/01/11/the-yaml-document-from-hell

Perhaps TOML would be a better choice? (which itself is not free of criticism either, of course) :fearful:

Ultimately, there’s – surprise! ;) – no perfect format:

pietroalbini commented 1 year ago

In the Rust community we use TOML extensively and... it's fine.

In my experience TOML is fairly nice and concise if the schema is designed around the TOML structure and limitations, and painful if you just uplift the schema you used in YAML into TOML. The suggestion I can make if you want to go with TOML is to start designing the REUSE schema from scratch with it rather than just port the YAML work and serialize it in TOML.

silverhook commented 1 year ago

I’ve been toying with TOML (in a different and very limited use case) a bit and so far my biggest issues were in practice just two:

I think REUSE could definitely be done simply in TOML, if we decide for that instead. Neither of the two issues I ran into should come up in REUSE really.

A very good point, @pietroalbini, thanks for the tip!

mxmehl commented 1 year ago

Yeah, I recall that we talked about the issues of YAML already when we talked about whether it should rather be JSON. We didn't make a decision as both have problems - spec-wise or user-friendliness-wise. We also had a short look at StrictYAML, but as this post suggests it's far from perfect.

I waver between YAML and TOML.

For reference, here's the current format we came up with in internal exchanges:

version: 1
annotations:
- path: src/*
  SPDX-FileCopyrightText:
    - 2020 Me
    - © 2017 You
  SPDX-License-Identifier: MIT
- path: test.md
  SPDX-FileCopyrightText:
    - "(c) containing a : for some reason must be quoted"
  SPDX-License-Identifier: 0BSD
silverhook commented 1 year ago

Just as an exercise, I think a TOML version could look as such:

version = 1

[[annotations]]
path = "src/*"
SPDX-FileCopyrightText = [
  "2020 Me",
  "© 2017 You",
  "(c) whitespace/identing is optional gGmbH"
]
SPDX-License-Identifier = "MIT"

[[annotations]]
path = [ "test.md", "README.md" ]
SPDX-FileCopyrightText = "(c) a string must always be quoted"
SPDX-License-Identifier = "0BSD OR Unlicense"

I’m sure @pietroalbini can come up with a more elegant way than I.

pietroalbini commented 1 year ago

That actually looks fairly good and idiomatic @silverhook! The only change I'd make is replacing the SPDX- names and just have copyright and license. Those names are more concise and easier to type, but that'd also apply to the YAML version.

andrewshadura commented 1 year ago

The only change I'd make is replacing the SPDX- names and just have copyright and license. Those names are more concise and easier to type, but that'd also apply to the YAML version.

That’s what I suggested some time ago, and it was rejected 🙂

mxmehl commented 1 year ago

The only change I'd make is replacing the SPDX- names and just have copyright and license. Those names are more concise and easier to type, but that'd also apply to the YAML version.

We discussed that but decided to stick with the known tags to make it easy for users and scanners.

For instance, some people also use other SPDX tags in comment headers, e.g. SPDX-FileContributor. The REUSE.yaml could also be a place for this kind of information. So sticking with one standard makes things much easier.

Regarding scanners, it was mentioned that SPDX tags would trigger false-positives. This would happen anyway with all the IDs and copyright statements.

mxmehl commented 1 year ago

Just as an exercise, I think a TOML version could look as such:

LGTM, except one line:

path = [ "test.md", "README.md" ]

Do we want path to be either a string or a list of strings? My gut feeling says no as I'd rather prefer a longer file with one path description per item.

Generally, I feel that the lists using [...] is less user-friendly than just bullet points (via dashes), but on the other hand I fully appreciate that indentation doesn't play such a decisive role.

silverhook commented 1 year ago

I don’t have strong feelings either way on the “string vs list of strings” question. I leave that to people who use that more often than I do. (I’ll only add that it feels a bit odd that SPDX-FileCopyrightText can be a list, but path and/or SPDX-License-Identifier can’t)

If it turns out it’s more preferable to keep it simple, while more verbose, we could just say that path can only be a string for one file/folder/glob. (needs better wording, of course).

In that case my example would be then:

version = 1

[[annotations]]
path = "src/*"
SPDX-FileCopyrightText = [
  "2020 Me",
  "© 2017 You",
  "(c) whitespace/identing is optional gGmbH"
]
SPDX-License-Identifier = "MIT"

[[annotations]]
path = "test.md"
SPDX-FileCopyrightText = "(c) a string must always be quoted"
SPDX-License-Identifier = "0BSD OR Unlicense"

[[annotations]]
path = "README.md" 
SPDX-FileCopyrightText = "(c) a string must always be quoted"
SPDX-License-Identifier = "0BSD OR Unlicense"
Tachi107 commented 1 year ago

Il giorno lun 16 gen 2023 alle 12:14:41 -08:00:00, Matija Šuklje @.***> ha scritto:

(I’ll only add that it feels a bit odd that SPDX-FileCopyrightText can be a list, but path and/or SPDX-License-Identifier can’t)

Letting SPDX-License-Identifier be an array can be ambiguous. The Meson build system allows this in their license field, but then you cannot tell if [ "GPL-3.0-or-later", "ISC" ] means "GPL-3.0-or-later AND ISC" or "GPL-3.0-or-later OR ISC". Yes, you could say that "both means AND", but why introduce yet another idiom when SPDX license expressions work fine?

Here's the Meson PR for completeness: https://github.com/mesonbuild/meson/pull/9940

eli-schwartz commented 1 year ago

Have you considered https://nestedtext.org/ in the list of potential file formats?

silverhook commented 1 year ago

Letting SPDX-License-Identifier be an array can be ambiguous. The Meson build system allows this in their license field, but then you cannot tell if [ "GPL-3.0-or-later", "ISC" ] means "GPL-3.0-or-later AND ISC" or "GPL-3.0-or-later OR ISC". Yes, you could say that "both means AND", but why introduce yet another idiom when SPDX license expressions work fine?

@Tachi107, I absolutely agree! I am not saying we should let SPDX-License-Identifier be an array – quite the opposite! – just that it seems inconsistent to let one field be an array, if two fields are not allowed to be (one of which absolutely rightfully so).

@eli-schwartz, could you provide an example – perhaps translate the one from @mxmehl or me to NestedText? And how widely is it supported/implemented? At a quick glance it looks pretty simple and easy to grasp.

pietroalbini commented 1 year ago

I would allow both SPDX-FileCopyrightText and path to be either arrays and simple strings. The rationale for paths is, there can be some files that are logically licensed the same (even with the same rationale) but just happen not to be matched by a glob pattern. The description per item kinda breaks down with glob patterns already.

eli-schwartz commented 1 year ago

@eli-schwartz, could you provide an example – perhaps translate the one from @mxmehl or me to NestedText? And how widely is it supported/implemented? At a quick glance it looks pretty simple and easy to grasp.

An example might look like this:

version: 1
annotations:
    -
        path: src/*
        SPDX-FileCopyrightText:
            - 2020 Me
            - © 2017 You
        SPDX-License-Identifier: MIT
    -
        path: test.md
        SPDX-FileCopyrightText:
            - (c) containing a : for some reason must be quoted
        SPDX-License-Identifier: 0BSD

The official implementation is python, https://nestedtext.org/en/stable/related_projects.html lists e.g. golang and ruby implementations.

silverhook commented 1 year ago

How is it with this line then? Does the : not trigger a key:value scenario?

            - (c) containing a : for some reason must be quoted
silverhook commented 1 year ago

To answer my own question, it seems it avoids that pitfall (and quoting is not needed).

To cite the documentation:

Line-type tags:

Most remaining lines are identified by the presence of tags, where a tag is:

the first dash (-), colon (:), or greater-than symbol (>) on a line when followed immediately by an ASCII space or line break;

or a hash {#), left bracket ([), or left brace ({) as the first non-ASCII-space character on a line.

These symbols only introduce tags when they are the first non-ASCII-space character on a line, except for the colon (:) which introduces a dictionary item with an inline key midway through a line.

The first (left-most) tag on a line determines the line type. Once the first tag has been found on the line, any subsequent occurrences of any of the line-type tags are treated as simple text. For example:

 - And the winner is: {winner}

In this case the leading -␣ determines the type of the line and the :␣ is simply treated as part of the remaining text on the line.

silverhook commented 1 year ago

IMHO both TOML and NestedText would work. At this stage, perhaps the best would be to test all these formats with a larger and more complex example to see how they fare in real life examples.