Open wtraylor opened 1 year ago
Interesting bug! Thank you for reporting this.
I ran into this issue as well using datalad. Fixing it would be greatly appreciated!
In the context of a git repository, my proposal for a fix would be to differentiate between two types of symlinks:
.git/annex/objects
, or ignored) -> consider the link to be a placeholder for a file that needs license and copyright information, therefore do not ignore the linkI am not sure how this would fit in with other VCS's though. Maybe if there is no VCS we should consider symlinks to files in the current directory and subdirectories of it to be of type 1 and symlinks pointing to something "external", i.e. outside of the current directory, to be of type 2. This assumes that the current directory can be considered a "project root".
I could take a stab at implementing this, but I would like some feedback on this proposal first.
@matrss
I'm not familiar enough with git-annex to have a good input on this. I think your proposal makes sense, even barring Git specifics, so:
But I'm a little uncertain. 'Outside the project' sounds a little out-of-scope for REUSE. I guess the question is: is data tracked by git-annex part of the project? If yes, let's lint the data. If no, let's not. If 'it depends', then let's pick one behaviour as default and add a flag to toggle the behaviour.
I err on the side of 'ignore all symlinks' as default behaviour, because it's heaps easier to document, and abides by the principle of least astonishment.
In a typical git-annex project you would have a number of "annex'ed" files. These files are simply symlinks tracked by git, which point to somewhere in .git/annex/objects
. When freshly cloning such a repository these symlinks will be "broken", which just means that the actual data is not there yet. git-annex then provides a command git annex get <path>
which can retrieve the specified file, meaning it will appear in .git/annex/objects
and the link is no longer broken. This makes it pretty nice to manage data projects with many GBs of data.
Since the symlinks are simply placeholders for the actual data, the data should definitely be considered part of the project. But because the symlinks might be "broken", I don't think we should ~resolve them~ read their target file at all.
In the terminology of the REUSE spec I think we should consider a symlink to another file under the project root (which is not ignored by the VCS) to be the same "Covered File" as it's target. Therefore we can ignore this symlink, since it's target will be lint'ed. This would keep the behaviour expected in #202. But if the symlink points outside of the project (e.g. into .git or to an ignored file, or really outside the project root) we should consider the symlink itself to be a "Covered File". In that case we have to provide a *.license file next to the symlink or specify license information in .reuse/dep5.
Not resolving symlinks has two advantages:
From a high-level point of view, the linked-to files aren't really outside of the project. In the case of git-annex they are simply distributed in a more manageable/efficient way, but still are part of the repository. A symlink might also point to a shared location (maybe a network drive), and linking instead of copying is simply a storage optimization.
I don't think we should add a flag for this. Making the result of reuse lint
depend on a flag implies different interpretations of the REUSE Specification on what is a Covered File. It would bring ambiguity to what is considered "reuse compliant" and what is not.
git-annex is a tool to manage big files in Git repositories. In science it is used by the Datalad community to manage dataset. Git-annex works by managing symbolic links in the Git work tree which point to the actual file conten in
.git/annex/objects
. Now,reuse
seems to ignore such symbolic links.Steps to reproduce
First issue is that
reuse lint
doesn’t complain about annexed files not having a license.This becomes worse if I do assign a license, but then
reuse lint
fails because of an “unused” license. This leads to failed CI pipelines. So to continue the above:Reuse version: 1.0.0
git-annex version: 10.20221103
Desired behavior
reuse
follows symlinks, at least if they are annexed files, which means that the symlink points to something in.git/annex/objects
.I see that issue #202 discussed the topic of symlinks. I suggest to revisit the issue for git-annex.