fsfe / reuse-tool

reuse is a tool for compliance with the REUSE recommendations.
https://reuse.software
401 stars 148 forks source link

Symbolic links (git-annex files) are ignored #627

Open wtraylor opened 1 year ago

wtraylor commented 1 year ago

git-annex is a tool to manage big files in Git repositories. In science it is used by the Datalad community to manage dataset. Git-annex works by managing symbolic links in the Git work tree which point to the actual file conten in .git/annex/objects. Now, reuse seems to ignore such symbolic links.

Steps to reproduce

First issue is that reuse lint doesn’t complain about annexed files not having a license.

git init test
cd test
git annex init
echo "hello" > my_file
git annex add my_file # The file becomes a symbolic link.
reuse lint # Good result despite missing license

This becomes worse if I do assign a license, but then reuse lint fails because of an “unused” license. This leads to failed CI pipelines. So to continue the above:

reuse addheader -l'CC0-1.0' -c'author' --force-dot-license my_file
reuse download "CC0-1.0"
reuse lint # fails despite CC0-1.0 being used

Reuse version: 1.0.0
git-annex version: 10.20221103

Desired behavior

reuse follows symlinks, at least if they are annexed files, which means that the symlink points to something in .git/annex/objects.


I see that issue #202 discussed the topic of symlinks. I suggest to revisit the issue for git-annex.

carmenbianca commented 1 year ago

Interesting bug! Thank you for reporting this.

matrss commented 1 year ago

I ran into this issue as well using datalad. Fixing it would be greatly appreciated!

In the context of a git repository, my proposal for a fix would be to differentiate between two types of symlinks:

  1. symlink points to another file that is tracked by git -> consider them to be the same file, which means we can safely ignore the link
  2. symlink points to something not tracked by git (i.e. outside of the repo, like .git/annex/objects, or ignored) -> consider the link to be a placeholder for a file that needs license and copyright information, therefore do not ignore the link

I am not sure how this would fit in with other VCS's though. Maybe if there is no VCS we should consider symlinks to files in the current directory and subdirectories of it to be of type 1 and symlinks pointing to something "external", i.e. outside of the current directory, to be of type 2. This assumes that the current directory can be considered a "project root".

I could take a stab at implementing this, but I would like some feedback on this proposal first.

carmenbianca commented 1 year ago

@matrss

I'm not familiar enough with git-annex to have a good input on this. I think your proposal makes sense, even barring Git specifics, so:

  1. If symlink points to another file under project root that -> ignore the symlink.
  2. If symlink points to a file outside the project -> resolve the symlink.

But I'm a little uncertain. 'Outside the project' sounds a little out-of-scope for REUSE. I guess the question is: is data tracked by git-annex part of the project? If yes, let's lint the data. If no, let's not. If 'it depends', then let's pick one behaviour as default and add a flag to toggle the behaviour.

I err on the side of 'ignore all symlinks' as default behaviour, because it's heaps easier to document, and abides by the principle of least astonishment.

matrss commented 1 year ago

In a typical git-annex project you would have a number of "annex'ed" files. These files are simply symlinks tracked by git, which point to somewhere in .git/annex/objects. When freshly cloning such a repository these symlinks will be "broken", which just means that the actual data is not there yet. git-annex then provides a command git annex get <path> which can retrieve the specified file, meaning it will appear in .git/annex/objects and the link is no longer broken. This makes it pretty nice to manage data projects with many GBs of data.

Since the symlinks are simply placeholders for the actual data, the data should definitely be considered part of the project. But because the symlinks might be "broken", I don't think we should ~resolve them~ read their target file at all.

In the terminology of the REUSE spec I think we should consider a symlink to another file under the project root (which is not ignored by the VCS) to be the same "Covered File" as it's target. Therefore we can ignore this symlink, since it's target will be lint'ed. This would keep the behaviour expected in #202. But if the symlink points outside of the project (e.g. into .git or to an ignored file, or really outside the project root) we should consider the symlink itself to be a "Covered File". In that case we have to provide a *.license file next to the symlink or specify license information in .reuse/dep5.

Not resolving symlinks has two advantages:

  1. The above mentioned case of "broken" symlinks will work without any issues, since we never actually follow them.
  2. The license and copyright information will actually be tracked inside the project. If we followed the symlink and read from a file header, someone else receiving the project directory would still have no idea about the licensing of the files that the symlink refers to, until they receive the linked files via some other means.

From a high-level point of view, the linked-to files aren't really outside of the project. In the case of git-annex they are simply distributed in a more manageable/efficient way, but still are part of the repository. A symlink might also point to a shared location (maybe a network drive), and linking instead of copying is simply a storage optimization.

I don't think we should add a flag for this. Making the result of reuse lint depend on a flag implies different interpretations of the REUSE Specification on what is a Covered File. It would bring ambiguity to what is considered "reuse compliant" and what is not.