Extract detailed copyright-holder info from git history

hoijui commented 1 year ago

This is meant to be a discussion.

In my work (and free time), I am helping a lot of Open Source (Hardware) projects, to arrive at a clean repo for their project. One important part of it, is to make their projects REUSE compliant. In 99% of the cases, they do not have any licensing or copyright info except a single LICENSE file in the repo. As you can imagine, adding all this info by hand .. the project owners will not do it, and for me it is.. frankly.. too repetitive and boring at this point. I would like to have a tool, that could get me 95% there automatically, and I know this is possible in most of these projects, given purely the info that is available in the git history. I am ready to write such a tool, if none exists yet, but would like to get some feedback first, on the idea (if it makes sense at all), and how to go about it, if it does make sense. I personally prefer Rust, but if need be could also give it a try in BASH, or if really need be, also in python. ;-)

would it maybe make sense to integrate this into reuse init, or have an extra/new sub-command for it?
Could/should we abstract the SCM system, so we can also support SVN, mercurial and co in the future?

mxmehl commented 1 year ago

First of all, thanks for your contributions and help to make other projects REUSE compliant! I know the pain of trying to reconstruct a project's history.

We discussed that a number of times already but felt that including this kind of functionality bears to many risks. The Git history is a great indicator for possible copyright holders, but quite often it is misleading: not everyone who touched the file is necessarily a copyright holder. It could be the company they work at for example. They could have just renamed/moved the file. We don't want to provide built-in functions that risk this kind of ambiguity, false-positives, and the burden of tweaking and tuning heuristics. It's just not in the tool's scope.

Therefore, we documented short snippets on how to search the Git history for author information. This could be a first step for semi-automatic reuse annotate commands, or to create a file such as AUTHORS or CONTRIBUTORS, if that's the project's policy.

Of course there may also be specialised tools to help with this, support different SCM systems, and perhaps even offer a nice workflow to review proposed changes to the files.

hoijui commented 1 year ago

thanks @mxmehl ! I had already seen the snippets.. good thing, thanks! :-) I definitely want more though, and if you say including more then that in the tool i snot wanted/too risky, I will make a separate tool. When you mentioned reuse annotate later on though, I got a bit confused whether you .. might be willing to include it anyway in the tool. can you please clarify on that?

mxmehl commented 1 year ago

With this I meant that an individual could probably easily hack together a script that a) goes through all files, b) runs the Git author extraction script(s), and c) uses reuse annotate to fill in the information. But, as said, a lot can go wrong between step b) and c) which is why we don't want to include this functionality directly. Each value passed to reuse annotate should be checked by a human.

A separate tool can surely assist with this process and the manual checks, and I'd hope that it will warn and inform its users thoroughly :)

hoijui commented 1 year ago

Thank you Max! :-) I guess you meant reuse addheader.

I already wrote a script for that by now. where would it make sense to publish/attach that? maybe for the sample scripts section?

I understand and whole-heatedly agree of course, that a human has to review everything a script might generate, but not having a script that generates as much info as possible, is not a solution. In practice, what you see is: a project with 300 files without license info. now.. realistically speaking, nobody will add all authors and licensing info manually for each file. if you supply a script that extracts as much info as possible from git, people will use that, and then hopefully check manually as much as possible/needed. If such a script does not exist, they will simply add the same license and author(s) to all files at once, or at least to big chunks/sub-dirs of the repo, and then be done with it. This is what I see with the projects I am helping (must be a hundred or so by now), and also with myself.

-> Not having this script leads to less accurate info, not to more accurate info, checked manually by humans.

hoijui commented 1 year ago

I would also not know, how I as a human would try to get more accurate authors info then from the git history for individual files. The license info is an other topic of course. In my script, I ask the user for a list of regexes in combination with SPDX license IDs, very similar like in .reuse/deb5.

mxmehl commented 1 year ago

reuse addheader has been renamed to reuse annotate, IIRC with v1.0. addheader is still an alias but will probably be deprecated in the future.

I think it would make sense to provide your script/tool in a completely separate repo. We might reference it in the scripts documentation or elsewhere.

It's absolutely helpful to scan the Git history for authors, and obviously also to ask project maintainers for their review if the changes are made by a non-maintainer like you. I think such a tool could do the following things:

Scan the Git history for authors
Provide options for merging multiple identities of an author (e.g. the github email aliases). However, this should also be done carefully, e.g. for persons contributing in their private time and as paid time by employer.
Look for existing copyright/license notices (reuse-tool can help with that, but tools like scancode are more advanced)
Inform user about the findings
Offer to run reuse annotate on the files. Also notice that there are a lot of options to this function, e.g. --no-replace or --merge-copyrights.

Some things that probably cannot be automated:

If you would like to check whether files/snippets are from third parties, e.g. other projects or even Stackoverflow, tools like ScanOSS may be helpful
If a file is renamed, its Git history is often lost. So git log of such a file does not show authors of edits that happened before the rename. Perhaps there are ways to detect the original file, and get the logs from this file?

fsfe / reuse-tool

Extract detailed copyright-holder info from git history #700