Open mxmehl opened 3 years ago
When it comes to YAML flavours I think all should be OK – I guess we would use an external parser and linter anyway, right?
For files that reuse.yaml
should target, I agree it should only affect its siblings and children. Parents etc. should be out of scope.
Regarding traditional copyright statements, I think it is reasonable to expect an SPDX tag, but after it, it should be free text form. Non-SPDX-tag statements were accepted before for legacy reasons. The YAML file is going to be new, so no legacy exists for it. Even if someone has a preferred format, they can just prepend it with SPDX tag.
Globbing – no preference, as long as it’s something that is in common practice and coherent.
Conflict resolution – I agree with your proposal.
I think that the syntax should avoid the strings “SPDX-License-Identifier:” and “SPDX-\<tagname>:”. Those strings are likely to cause false positives. Tools that aren’t REUSE.yml aware will mistakenly assume that the data applies to REUSE.yml. Here’s my proposal:
- files: "src/*"
info:
- "FileCopyrightText: 2020 Me"
- "FileCopyrightText: © 2017 You"
- "License-Identifier: MIT"
- files: "src/*"
info: |
FileCopyrightText: 2020 Me
FileCopyrightText: © 2017 You
License-Identifier: MIT
- files: "src/*"
"FileCopyrightText":
- "2020 Me"
- "© 2017 You"
"License-Identifier": MIT
If we do decide to drop the “SPDX-”, then I would recommend option 3. That way, if someone makes a mistake and includes the “SPDX-”, they have to do less to fix it.
I would also recommend making the REUSE Tool give a helpful error when this mistake happens. For example, it could say “Found ‘SPDX-License-Identifier’ in REUSE.yml. In REUSE.yml, use ‘License-Identifier’ instead (no ‘SPDX-’).”
Great catch, @Jayman2000! What you write makes sense to me. It does provide some extra complication, but seems worth it to me in order to avoid future issues.
Why not rename FileCopyrightText
to copyrights
and License-Identifier
to license
? A similar format is already used by scan-copyrights.
Why not rename
FileCopyrightText
tocopyrights
andLicense-Identifier
tolicense
? A similar format is already used by scan-copyrights.
We would like to have the SPDX project make this part of their spec, too, in order to not create conflicts with other compliance tools and practices (see: spdx/spdx-spec#502).
In SPDX, there are multiple "license" fields for instance, e.g. the concluded or declared license. I am afraid that this unclear terminology would not pass SPDX. However, main goal is to avoid confusion: so either we stick with the tags that are already used in REUSE (except in DEP5) or we make them really simple (as you suggested).
We would like to have the SPDX project make this part of their spec, too, in order to not create conflicts with other compliance tools and practices (see: spdx/spdx-spec#502 https://github.com/spdx/spdx-spec/issues/502).
Why I can see why you might want that, I'm not sure that's a goal worth pursuing. One of the reasons I keep my usage of SPDX to the minimum is its verbosity. I fear if and when your proposal is merged into SPDX, it's going to become yet another verbose way of specifying licensing information people will avoid.
I'm also unsure why you want to deprecate DEP-5, which in my view is superior to many other similar formats. If something isn't quite right in it, I'd personally try to evolve it into a machine-readable copyright format 2.0 rather than abandon it completely.
Most of this looks good to me. I would like to add my two cents in regards to two things:
I don’t think YAML vs JSON is an issue with Python: there are multiple YAML libraries for Python (pyyaml, ruamel, strict-yaml), so YAML is quite well-supported. JSON is much less readable even when pretty-printed, it requires commas between list elements but not after, and I wouldn’t count on anything that generates it to actually pretty-print it. In my experience most generated JSON was dumped onto a single endless line, and most generated YAML was formatted and human-readable.
JSON is in the standard library and json.dump()
supports decent printing with the indent
parameter. Perhaps strict-yaml could serve a similar purpose, but most of the time, JSON is the stricter, more well defined version of YAML IMHO.
Why I can see why you might want that, I'm not sure that's a goal worth pursuing. One of the reasons I keep my usage of SPDX to the minimum is its verbosity. I fear if and when your proposal is merged into SPDX, it's going to become yet another verbose way of specifying licensing information people will avoid.
The files we intent to use have not much in common with a full SPDX SBOM, for which I agree that they are impossible to parse for humans. However, making REUSE's labelling compatible with an ISO standard has the great advantage that the likelihood of being compatible with other tools and best practices is much higher.
I see the advantage of creating own specs, but following the practice of "not invented here" even if there are somewhat good alternatives has only seldomly advanced technology.
I'm also unsure why you want to deprecate DEP-5, which in my view is superior to many other similar formats. If something isn't quite right in it, I'd personally try to evolve it into a machine-readable copyright format 2.0 rather than abandon it completely.
Please read the full discussion and proposal that I've linked in the first post. There are good reasons why DEP-5 is not ideal for our purpose: https://lists.fsfe.org/pipermail/reuse/2020q3/000085.html
Reading the discussion on https://github.com/spdx/spdx-spec/issues/502 one thing stands out to me, the desire to align with the SPDX YAML. I think the current thoughts best align with the files
section. The packages
section apparently is of interested to the community as listed in the same thread, but that might be out of scope for now. So I think we need to look closer at
https://github.com/spdx/spdx-spec/blob/e25d183ade64c123770412297b9bf5086a7ed0bf/examples/SPDXYAMLExample-2.2.spdx.yaml#L241
Based on that I would consider a file like:
---
spdxVersion: "SPDX-2.3" # mandatory to allow future spec changes
creationInfo: # optional
comment: "Easily add metadata to image files."
created: "2022-05-25"
# and other metadata if desired
# FIXME: perhaps needs information that this is to be considered input, not output
files:
# In line with SPDX YAML output
- copyrightText: "Copyright Photographer X"
fileContributors: ["Photographer X"] # optional
licenseConcluded: "CC-BY-4.0"
fileName: "./images/other-author.jpg"
# My main proposal for simplicity
- fileGlob: "./images/*.jpg" #or another term, but to differentiate from 'fileName'
copyrightText: |
Copyright 2022 Photographer X
Copyright © 2022 Image editor Y
fileContributors:
- "Photographer X"
- "Image editor Y"
licenseConcluded: "CC-BY-4.0" # I don't see a reason to change the key, or is there?
I know the format is quite different from earlier proposals:
fileGlob
, another idea I have is the term filePath
.I step into this discussion quite late, so feel free to point out my false reasoning.
Please read the full discussion and proposal that I've linked in the first post. There are good reasons why DEP-5 is not ideal for our purpose: https://lists.fsfe.org/pipermail/reuse/2020q3/000085.html
Apart from having to put the file in .reuse/
, what's the issue with dep5? I might be biased as I'm involved in Debian stuff, but it seems that so far that format has served users well (well defined, widely used, easy to write, concise).
Instead of creating a new YAML format, have you considered extending dep5 support so that it is possible to put files at any directory level? Like what you are proposing with REUSE.yaml
, users would be able to create different dep5 files named REUSE.dep5
at any point in their directory hierarchy. This would fix one major limitation of the current dep5 integration, while avoiding annoying users that would have to migrate their (possibly large) .reuse/dep5
files to a new incompatible format.
Also, from the linked email:
The first downside of DEP5 is that the tags are different from the normal SPDX/REUSE tags
Using License
instead of SPDX-License-Identifier
isn't that big of a deal IMO, as the extra verbosity of the file tag is needed so that it can be easily extracted from general files- an ad-hoc file doesn't need extra qualifiers. As for Copyright
, it is a REUSE tag. Also, judging from the proposals above, it seems that keys would also differ in this new format (copyrightText
vs SPDX-FileCopyrightText
and licenseConcluded
vs SPDX-License-Identifier
).
[dep5] requires some other meta information out of REUSE's scope
The only required information that's not directly related to REUSE is the Format
key, that would be needed in a custom YAML format anyway to allow format changes.
On the other hand if this YAML format gets standardized as an official SPDX format and it is not too verbose it would be nice to adopt it instead :)
Edit: forgot to mention, but implementation details such as Python's standard library support for YAML, JSON, etc should not be a high priority (I wouldn't consider them at all... one of the points of standardizing a format is the possibility of having different interoperable implementations, regardless of the programming language used)
@mxmehl to followup on the issues I identified in https://github.com/rust-lang/rust/pull/99415#issuecomment-1219355327, I'm wondering whether Tachi's proposal of a REUSE.dep5
file rather than (or in addition to) REUSE.yaml
would be accepted.
The discussion to define the YAML format seems to have stalled on the SPDX side, and implementing REUSE.dep5
seems to require way less design work and consensus gathering, at least from my outside perspective.
Quite the opposite, I’m afraid, @pietroalbini.
There are several points where DEP5 (mostly, but not only, due to historical reasons) differs from SPDX and REUSE.
To use DEP5 in REUSE was a good hack early on, but as it (and SPDX) becomes more wide-spread, the problems, exceptions, workarounds etc. that REUSE would need to do to make DEP5(-ish) usable make it quite an obstacle.
And bending DEP5 to suit REUSE seems to break much more than creating our own SPDX(-derived) YAML format.
I don't get it. The machine-readable copyright format, to which you not quite correctly refer to as DEP5, has been in use in Debian for quite a long time, more than a decade if I remember correctly. So far, as far as I'm aware, we haven't received requests for improvement from Reuse, but if we did, I'm certain they could eventually result in a version 1.1 or even 2.0. After all, the goal of the format was to provide human- and machine-readable way of documenting license and copyright information, so if it didn't fulfill that goal, improving it was never off the table.
The only real downside of it as opposed to a YAML-based format is a need for a parser, but that's been solved ages ago (and also the format is a composition of well-known standards such as RFC 822, so it's not exactly something odd).
-- Cheers, Andrej
@silverhook I understand your desire for a format compatible with the wider SPDX ecosystem! I don't have a preference for either choice myself, but there are currently issues that I'd like to help fix that are blocked on this.
The point I was making was that to adopt REUSE.dep5
there is only a need for consensus within the REUSE project (as the format is already standardized and implemented within REUSE), while defining a YAML format requires resolving the open questions, designing the format, and gathering consensus within SPDX (with a lot more stakeholders in the room).
Of course I'm an outsider to the project, and I don't have many insights on how hard gathering the consensus within the REUSE project would be :slightly_smiling_face:
As I hinted before, I'm working to adopt REUSE in the Rust compiler, and we're facing some blocker issues:
.reuse/dep5
and per-file license annotations produces incorrect results most of the times (at least for Rust), as REUSE considers both the licenses in the dep5 and the files at the same time. Work to define a more consistent precedence in https://github.com/fsfe/reuse-docs/issues/70 is blocked on having a REUSE.yaml
..reuse/dep5
breaks our monorepo approach (we're using git subtrees to merge other repositories into the monorepo, so --include-submodules
doesn't work). The solution to that would be REUSE.yaml
, as you can have multiple of them, but as https://github.com/fsfe/reuse-docs/issues/90 correctly points out, that blocked on this issue.I'm willing to help with some implementation work to solve the two issues I mentioned above, but designing and gathering consensus in SPDX for a suitable format is going to take more time than I can commit.
To be clear, I don't want to pressure you into making a choice you don't like just because we want to adopt REUSE in the Rust project. If we can't find a solution in the near term to those issues, we'll just have to create our own bespoke tooling and wait for those issues to be addressed before reconsidering REUSE.
Citing @silverhook:
To use DEP5 in REUSE was a good hack early on, but as it (and SPDX) becomes more wide-spread, the problems, exceptions, workarounds etc. that REUSE would need to do to make DEP5(-ish) usable make it quite an obstacle.
As I asked in https://github.com/fsfe/reuse-docs/issues/81#issuecomment-1146888221, could you please explain why DEP5 doesn't currently suit REUSE's needs? Yes, it doesn't support all SPDX's features, but neither does REUSE. As far as I understand, SPDX's scope is far broader than just handling licensing information, while REUSE's goal is to "Make licensing easy for everyone", and DEP5's simple and limited format perfectly aligns with this goal, as I've been able to observe in different open source projects.
I don't know your plans for the future of REUSE, so I'm of course missing something. Hence, would you please help us better understand your point? Thanks :)
REUSE and Debian use DEP5 for very different purposes. In Debian, DEP5 is a comprehensive way to declare the copyright and licensing of a project. In REUSE, its design intent is a fallback to declare copyright and licensing for scenarios where headers or .license files are impossible or unwanted. You're not really supposed to copy a debian/copyright
from Debian into the .reuse/dep5
of an upstream project. I outlined the reasons for this here. Using a non-DEP5 format helps underscore the difference in purpose.
The python-debian
dependency is not satisfactory:
This issue doesn't reflect it, but we're thinking of extending the proposed syntax/format in this issue to define precedence (#70 adjacent) and overriding. I'm not entirely sure how DEP5 does precedence at the moment, but the results from DEP5 and the file headers are aggregated with no toggle to change this behaviour. We could put this toggle next to the glob in REUSE.yaml. Furthermore—and this issue also doesn't reflect this—but we could further extend the syntax to enable a glob scenario such as 'all files in docs/* except those with a certain file extension'. We get a lot more wiggling room for granularit when using a different format.
This is subjective, but I think there's value in putting the configuration in a file format that developers are already familiar with. Right now, developers kind of have to divine how to write valid DEP5 from example, but they already know how to write valid YAML.
Thanks for you nice and complete reply!
- REUSE and Debian use DEP5 for very different purposes. In Debian, DEP5 is a comprehensive way to declare the copyright and licensing of a project. In REUSE, its design intent is a fallback to declare copyright and licensing for scenarios where headers or .license files are impossible or unwanted. You're not really supposed to copy a
debian/copyright
from Debian into the.reuse/dep5
of an upstream project.
I completely agree with this point. In fact, I find it a bit odd that Rust decided not to add license headers to their files.
- The
python-debian
dependency is not satisfactory:
Yeah, that's true. If I were a Python guy I would've put some effort into moving the DEP5 parser in a separate, less Debian-specific package. But I'm not :/
- This issue doesn't reflect it, but we're thinking of extending the proposed syntax/format in this issue to define precedence (Define precedence of information #70 adjacent) and overriding. I'm not entirely sure how DEP5 does precedence at the moment, but the results from DEP5 and the file headers are aggregated with no toggle to change this behaviour. We could put this toggle next to the glob in REUSE.yaml. Furthermore—and this issue also doesn't reflect this—but we could further extend the syntax to enable a glob scenario such as 'all files in docs/* except those with a certain file extension'.
Isn't option one in the linked issue independent of the file format? Also, I think that adding support in DEP5 for a glob like the one you mentioned ("all files in docs/* except those with a certain file extension") is something that could be useful to Debian too. Anyway, yes, DEP5 doesn't support, and likely never will, any overriding mechanism, but please keep in mind that adding such a feature could be a double edged sword - ideally, REUSE.yaml (or REUSE.dep5) should be easily understandable without having to look to much at the documentation.
- This is subjective, but I think there's value in putting the configuration in a file format that developers are already familiar with. Right now, developers kind of have to divine how to write valid DEP5 from example, but they already know how to write valid YAML.
I'd argue that DEP5 is way more user friendly than YAML, especially if you've never used neither of those before (and if you're not used to the concept that indentation really matters) - but as you say, this is subjective.
In any case, please keep in mind that Debian really cares about license compliance and copyright attributions (the copyright format was not created by accident!), and I'm sure some Debian folks (including me) would be more than glad to help with REUSE (with regards to evolving DEP5, making the python parser more portable and reliable, etc.) :)
Thanks @carmenbianca for explaining the concerns you all have about using DEP5 for the new file format. Having more clarity on that rationale helps.
I'm wondering then, what are the next steps for this issue? Both of the issues preventing Rust from adopting REUSE are blocked on this issue, and while I have some time to spend on improvements to REUSE, gathering consensus for a format inside SPDX is something I unfortunately can't commit to.
You're not really supposed to copy a debian/copyright from Debian into the .reuse/dep5 of an upstream project. I outlined the reasons for this https://github.com/fsfe/reuse-tool/issues/605#issuecomment-1276560576. Using a non-DEP5 format helps underscore the difference in purpose.
I completely agree with this point. In fact, I find it a bit odd that Rust decided not to add license headers to their files.
Heh, I agree that in an ideal world adding per-file headers would be better, but there is opposition in the Rust project to add those headers, and 5 years ago the project decided to remove the existing headers from the codebase. Having the licensing definitions into a centralized file is the compromised I managed to reach.
Thanks for the constructive exchange of opinions and arguments!
Heh, I agree that in an ideal world adding per-file headers would be better, but there is opposition in the Rust project to add those headers, and 5 years ago the project decided to remove the existing headers from the codebase. Having the licensing definitions into a centralized file is the compromised I managed to reach.
I understand. Thanks for what you tried and accomplished!
I'm wondering then, what are the next steps for this issue? https://github.com/fsfe/reuse-docs/issues/81#issuecomment-1274267827 are blocked on this issue, and while I have some time to spend on improvements to REUSE, gathering consensus for a format inside SPDX is something I unfortunately can't commit to.
Understandable. The REUSE team is working on creating a concrete proposal for including this in the next SPDX spec (whenever this will be released...) and will include some stakeholders later in the process to implement feedback early on and reduce friction. No concrete timeline yet and certainly nothing that's done in the next few weeks unfortunately.
I already did in the REUSE chat, but I hereby publicly volunteer to take on the SPDX side of this. (This is not to contradict @mxmehl , but to support him and perhaps make the public message more clear that people are working on this.)
REUSE snippets support just about got into the last SPDX spec version on time, so there’s ample time until the next revision.
From what I can tell, the way we set up REUSE so far, it shouldn’t be a huge impact on SPDX anyway. So as long as someone keeps an eye that we’re using the right SPDX tags and not misusing them (again, I volunteer for that part), we should be able to draft a full reuse.yaml
spec and then if anything at all needs to included into SPDX Spec, sync up with SPDX.
I’m not happy with this discovery, esp. this late in the development of REUSE.yaml
, but it does shed some light why some (apparently rightly so) look negatively on YAML.
https://ruudvanasseldonk.com/2023/01/11/the-yaml-document-from-hell
Perhaps TOML would be a better choice? (which itself is not free of criticism either, of course) :fearful:
Ultimately, there’s – surprise! ;) – no perfect format:
In the Rust community we use TOML extensively and... it's fine.
In my experience TOML is fairly nice and concise if the schema is designed around the TOML structure and limitations, and painful if you just uplift the schema you used in YAML into TOML. The suggestion I can make if you want to go with TOML is to start designing the REUSE schema from scratch with it rather than just port the YAML work and serialize it in TOML.
I’ve been toying with TOML (in a different and very limited use case) a bit and so far my biggest issues were in practice just two:
"
in keys are fine (you need them if you want spaces in keys), and "
in values force it to be a string. So a value of "20"
is not the same as 20
– which is a bit confusing, but not terribly soI think REUSE could definitely be done simply in TOML, if we decide for that instead. Neither of the two issues I ran into should come up in REUSE really.
A very good point, @pietroalbini, thanks for the tip!
Yeah, I recall that we talked about the issues of YAML already when we talked about whether it should rather be JSON. We didn't make a decision as both have problems - spec-wise or user-friendliness-wise. We also had a short look at StrictYAML, but as this post suggests it's far from perfect.
I waver between YAML and TOML.
For reference, here's the current format we came up with in internal exchanges:
version: 1
annotations:
- path: src/*
SPDX-FileCopyrightText:
- 2020 Me
- © 2017 You
SPDX-License-Identifier: MIT
- path: test.md
SPDX-FileCopyrightText:
- "(c) containing a : for some reason must be quoted"
SPDX-License-Identifier: 0BSD
Just as an exercise, I think a TOML version could look as such:
version = 1
[[annotations]]
path = "src/*"
SPDX-FileCopyrightText = [
"2020 Me",
"© 2017 You",
"(c) whitespace/identing is optional gGmbH"
]
SPDX-License-Identifier = "MIT"
[[annotations]]
path = [ "test.md", "README.md" ]
SPDX-FileCopyrightText = "(c) a string must always be quoted"
SPDX-License-Identifier = "0BSD OR Unlicense"
I’m sure @pietroalbini can come up with a more elegant way than I.
That actually looks fairly good and idiomatic @silverhook! The only change I'd make is replacing the SPDX-
names and just have copyright
and license
. Those names are more concise and easier to type, but that'd also apply to the YAML version.
The only change I'd make is replacing the
SPDX-
names and just havecopyright
andlicense
. Those names are more concise and easier to type, but that'd also apply to the YAML version.
That’s what I suggested some time ago, and it was rejected 🙂
The only change I'd make is replacing the
SPDX-
names and just havecopyright
andlicense
. Those names are more concise and easier to type, but that'd also apply to the YAML version.
We discussed that but decided to stick with the known tags to make it easy for users and scanners.
For instance, some people also use other SPDX tags in comment headers, e.g. SPDX-FileContributor
. The REUSE.yaml could also be a place for this kind of information. So sticking with one standard makes things much easier.
Regarding scanners, it was mentioned that SPDX tags would trigger false-positives. This would happen anyway with all the IDs and copyright statements.
Just as an exercise, I think a TOML version could look as such:
LGTM, except one line:
path = [ "test.md", "README.md" ]
Do we want path
to be either a string or a list of strings? My gut feeling says no as I'd rather prefer a longer file with one path description per item.
Generally, I feel that the lists using [...]
is less user-friendly than just bullet points (via dashes), but on the other hand I fully appreciate that indentation doesn't play such a decisive role.
I don’t have strong feelings either way on the “string vs list of strings” question. I leave that to people who use that more often than I do. (I’ll only add that it feels a bit odd that SPDX-FileCopyrightText
can be a list, but path
and/or SPDX-License-Identifier
can’t)
If it turns out it’s more preferable to keep it simple, while more verbose, we could just say that path
can only be a string for one file/folder/glob. (needs better wording, of course).
In that case my example would be then:
version = 1
[[annotations]]
path = "src/*"
SPDX-FileCopyrightText = [
"2020 Me",
"© 2017 You",
"(c) whitespace/identing is optional gGmbH"
]
SPDX-License-Identifier = "MIT"
[[annotations]]
path = "test.md"
SPDX-FileCopyrightText = "(c) a string must always be quoted"
SPDX-License-Identifier = "0BSD OR Unlicense"
[[annotations]]
path = "README.md"
SPDX-FileCopyrightText = "(c) a string must always be quoted"
SPDX-License-Identifier = "0BSD OR Unlicense"
Il giorno lun 16 gen 2023 alle 12:14:41 -08:00:00, Matija Šuklje @.***> ha scritto:
(I’ll only add that it feels a bit odd that SPDX-FileCopyrightText can be a list, but path and/or SPDX-License-Identifier can’t)
Letting SPDX-License-Identifier be an array can be ambiguous. The Meson
build system allows this in their license
field, but then you cannot
tell if [ "GPL-3.0-or-later", "ISC" ] means "GPL-3.0-or-later AND ISC"
or "GPL-3.0-or-later OR ISC". Yes, you could say that "both means AND",
but why introduce yet another idiom when SPDX license expressions work
fine?
Here's the Meson PR for completeness: https://github.com/mesonbuild/meson/pull/9940
Have you considered https://nestedtext.org/ in the list of potential file formats?
Letting SPDX-License-Identifier be an array can be ambiguous. The Meson build system allows this in their
license
field, but then you cannot tell if [ "GPL-3.0-or-later", "ISC" ] means "GPL-3.0-or-later AND ISC" or "GPL-3.0-or-later OR ISC". Yes, you could say that "both means AND", but why introduce yet another idiom when SPDX license expressions work fine?
@Tachi107, I absolutely agree! I am not saying we should let SPDX-License-Identifier
be an array – quite the opposite! – just that it seems inconsistent to let one field be an array, if two fields are not allowed to be (one of which absolutely rightfully so).
@eli-schwartz, could you provide an example – perhaps translate the one from @mxmehl or me to NestedText? And how widely is it supported/implemented? At a quick glance it looks pretty simple and easy to grasp.
I would allow both SPDX-FileCopyrightText
and path
to be either arrays and simple strings. The rationale for paths is, there can be some files that are logically licensed the same (even with the same rationale) but just happen not to be matched by a glob pattern. The description per item kinda breaks down with glob patterns already.
@eli-schwartz, could you provide an example – perhaps translate the one from @mxmehl or me to NestedText? And how widely is it supported/implemented? At a quick glance it looks pretty simple and easy to grasp.
An example might look like this:
version: 1
annotations:
-
path: src/*
SPDX-FileCopyrightText:
- 2020 Me
- © 2017 You
SPDX-License-Identifier: MIT
-
path: test.md
SPDX-FileCopyrightText:
- (c) containing a : for some reason must be quoted
SPDX-License-Identifier: 0BSD
The official implementation is python, https://nestedtext.org/en/stable/related_projects.html lists e.g. golang and ruby implementations.
How is it with this line then? Does the :
not trigger a key:value scenario?
- (c) containing a : for some reason must be quoted
To answer my own question, it seems it avoids that pitfall (and quoting is not needed).
To cite the documentation:
Line-type tags:
Most remaining lines are identified by the presence of tags, where a tag is:
the first dash (
-
), colon (:
), or greater-than symbol (>
) on a line when followed immediately by an ASCII space or line break;or a hash {
#
), left bracket ([
), or left brace ({
) as the first non-ASCII-space character on a line.These symbols only introduce tags when they are the first non-ASCII-space character on a line, except for the colon (
:
) which introduces a dictionary item with an inline key midway through a line.The first (left-most) tag on a line determines the line type. Once the first tag has been found on the line, any subsequent occurrences of any of the line-type tags are treated as simple text. For example:
- And the winner is: {winner}
In this case the leading
-␣
determines the type of the line and the:␣
is simply treated as part of the remaining text on the line.
IMHO both TOML and NestedText would work. At this stage, perhaps the best would be to test all these formats with a larger and more complex example to see how they fare in real life examples.
As discussed in spdx/spdx-spec#502, the SPDX project plans to support a "metadata, pre-document file" that contains specific information about files relative to its position. This follows a request to implement something called REUSE.yaml, first discussed here. This issue is to discuss the exact format and syntax of the file.
Proposed YAML options
In the original discussion, we proposed four different syntaxes. One of them (also disliked by the REUSE team) has been turned down in a SPDX call. I removed two others as they are rather unintuitive and clumsy. Also, I changed the format a bit to comply with the YAML syntax (using
*
as key name is invalid), and added another option.Option 1: list
Each list item is a SPDX tag as used in file headers. Easy to read thanks to the
-
, but all items must be wrapped in"
to escape the:
which would separate a key from a value – we cannot have multiple keys!Option 2: multi-line string
SPDX tags are just separated by new lines. No
-
or escaping of:
are required. However, indentation must be preserved for all lines!Option 3: license and copyright as separate keys
We could also separate the two information items. Downside: the keys must be wrapped in
"
to escape the-
in the key name.Background on the YAML keys
Unlike the SPDX YAML format, we would like to avoid
copyrightText
andlicenseDeclared
as key names. In REUSE, theSPDX-License-Identifier
andSPDX-FileCopyrightText
(or alternatively traditional, varying copyright statements) are common and understood by the users.This was also accepted in the SPDX call.
Possible targets
REUSE.yaml is intended to target files that are relative to its position, and only those that are "below".
Statements like
files: "../../src/*"
should not be possible.Supporting traditional copyright statements?
A related question is whether we should only support
SPDX-FileCopyrightText
as indicator for files' copyright, or also "traditional" statements like "Copyright © 2021 Jane Doe".REUSE recommends the SPDX tag, but also supports the traditional statements. My suggestion would be to do the same in REUSE.yaml to reduce friction, but in SPDX this could lead to conflicts. Happy to collect opinions here!
Globbing
DEP-5 uses a simple glob syntax. In this,
*/Makefile
would include any Makefile in all paths below. I am not sure whether this globbing is represented in any native Python module. The benefit of sticking with the DEP-5 glob is that we could more easily convert existing DEP-5 files to REUSE.yaml.Another possibility would be using the Python-native glob.
*/Makefile
would only match a Makefile in one level below, while**/Makefile
would match all Makefiles.We could also use pathspec, supporting the same globbing as
gitignore
.Conflict resolution
As in DEP-5, I would suggest that the last match of a file wins. So if the file
foo.txt
is first matched by*
and then*.txt
, the last statement would count.The dependecy resolution within REUSE and its different options – including REUSE.yaml – is discussed in #70.