IP Clearance - Githubissues

justinmclean commented 4 years ago

All files developed at the ASF need to have an ASF header [1], 3rd party headers for the most part need to be retained [2]

btashton commented 4 years ago

@justinmclean Can you help me understand our requirements here a little bit more with a couple examples:

https://github.com/apache/incubator-nuttx/blob/master/arch/risc-v/include/arch.h It would seem that this needs to keep the BSD header until Ken re-licenses it under Apache, and we need to call this file out in the LICENSE file as BSD-3, it would not need to be called out in the NOTICE file.

2.

https://github.com/apache/incubator-nuttx/blob/master/arch/arm/src/arm/arm.h This one we can put the Apache header on, but do not need to make and additions to the NOTICE or LICENSE files beyond the boilerplate Apache. This is because Greg has agreed to re-licence this code.

3.

https://github.com/apache/incubator-nuttx/blob/master/arch/arm/src/imxrt/hardware/rt102x/imxrt102x_ccm.h This one we can put the Apache header on, but do not need to make and additions to the NOTICE or LICENSE files beyond the boilerplate Apache. This is because Greg has agreed to re-licence this code, and while there are other Authors listed he is the sole copyright holder listed.

4.

https://github.com/apache/incubator-nuttx/blob/master/arch/arm/src/imxrt/imxrt_lcd.c It would seem that this needs to keep the BSD header unless NXP is willing to relicense it under Apache even though portions are copyrighted by Greg, and we need to call this file out in the LICENSE file as BSD-3, it would not need to be called out in the NOTICE file.

General Questions: When do we need to be adding the "Based on source code originally developed by" to the NOTICE file. In a couple of the files coming from FreeBSD I see entries like

Portions of this software were developed by David Chisnall

under sponsorship from the FreeBSD Foundation.

I know we have other files with other license or cases to go through, but this should cover the vast majority and can get us moving in the right direction.

btashton commented 4 years ago

@justinmclean any thoughts on these examples. I'm trying to be 100% sure I understand what we need to do here to move this forward in a meaningful way.

justinmclean commented 4 years ago

Correct
Correct
It would depend on the history of the file and changes made. In general unless teh changes are significant the original license and header should be kept.
Would need to be discussed, in general 3rd party headers should not be changed without permission. Looks like they have here and it would be best to revert to the original header.

justinmclean commented 4 years ago

Note that with a WIP disclaimer none of this actually blocks a release.

adamfeuer commented 4 years ago

License clearing wiki page (with draft process and tools): https://cwiki.apache.org/confluence/display/NUTTX/License+Clearing

This was used in release 9.0.0 and 9.1.0.

xiaoxiang781216 commented 4 years ago

@adamfeuer do you have enough free time to collect the statistics inforamtion? My team leader reserve a dedicated resource help you to improve the tools and generate the report. @PeterBee97.

adamfeuer commented 4 years ago

Thanks @xiaoxiang781216 – I should have enough time to do a high-level analysis this week or next, and I could definitely use the help!

@PeterBee97 are you able to help me do this? If so, reply here or send me an email (it's on my profile), and we'll work out what to do. 🙂

PeterBee97 commented 4 years ago

@adamfeuer Hi Adam, sure I'm here to help. BTW I spent some time yesterday on a script that doesn't modify anything yet but only tries to extract information. Hope this helps :)

adamfeuer commented 4 years ago

@PeterBee97 Great work with the script and database! I'll update my tools branch and post it here– would you be willing to do a PR to that, so we can have a single branch that we're working on? I'm hoping we can merge these tools to master so that others can help us or continue our work.

Here's a few questions:

Are you subscribed to the dev@nuttx.apache.org email list? If not, would you be willing to subscribe?
What's your email address? Will you either post it here, send me an email at adam@adamfeuer.com? So we can correspond with the NuttX email list if necessary.
What time zone are you in? I am in Seattle WA USA, Pacific Time Zone, UTC-7.
Have you seen the NuttX license clearing wiki page? The process we need to follow and improve is there, as well as a few tools.
The authors in the file are good to have, but not enough to clear the licenses– we need to look at the git log and get authors from that. There's a script on the wiki page above that can do that.
Would you be willing to make the script you wrote also emit a plain text file, ideally tab delimited CSV?

adamfeuer commented 4 years ago

@PeterBee97 I updated my license-clearing tools branch to upstream/master, here's where I've put my tools: https://github.com/starcat-io/incubator-nuttx/tree/feature/license-clearing-tools/tools/license-clearing

adamfeuer commented 4 years ago

@PeterBee97 Let's try running the process that we did on the sched/ module on either fs/ or mm/– only the estimation part, not the whole clearing process. They have 100-250 files each, so it's a smaller chunk. We need git authors as well a what is in the file headers. Once we have a way to get stats for that module and all files, then we can try to do it for the whole project.

You can see what we did on sched/ at this wiki subpage: https://cwiki.apache.org/confluence/display/NUTTX/Analysis+March+2020

PeterBee97 commented 4 years ago

@PeterBee97 Great work with the script and database! I'll update my tools branch and post it here– would you be willing to do a PR to that, so we can have a single branch that we're working on? I'm hoping we can merge these tools to master so that others can help us or continue our work.

Here's a few questions:

Are you subscribed to the dev@nuttx.apache.org email list? If not, would you be willing to subscribe?

What's your email address? Will you either post it here, send me an email at adam@adamfeuer.com? So we can correspond with the NuttX email list if necessary.

What time zone are you in? I am in Seattle WA USA, Pacific Time Zone, UTC-7.

Have you seen the NuttX license clearing wiki page? The process we need to follow and improve is there, as well as a few tools.

The authors in the file are good to have, but not enough to clear the licenses– we need to look at the git log and get authors from that. There's a script on the wiki page above that can do that.

Would you be willing to make the script you wrote also emit a plain text file, ideally tab delimited CSV?

Not yet, sure I'm willing to subscribe
bijunda1@xiaomi.com
I'm in Beijing, UTC+8 so my work time will be about 7 pm to 7 am in your timezone :(
Yes, I browsed through the docs and mailing lists before making that tool
Yeah, actually my tool is based on your script. The author0~author2 are from git log
Sure, exporting to csv file is just one command in sqlite

@PeterBee97 Let's try running the process that we did on the sched/ module on either fs/ or mm/– only the estimation part, not the whole clearing process. They have 100-250 files each, so it's a smaller chunk. We need git authors as well a what is in the file headers. Once we have a way to get stats for that module and all files, then we can try to do it for the whole project.

You can see what we did on sched/ at this wiki subpage: https://cwiki.apache.org/confluence/display/NUTTX/Analysis+March+2020

By typing sched/ in the DB Browser filter I can see that these files either have apache license already or only owe copyrights to Greg or Xiaomi & Pinecone, which should have already approved the license change.

The csv files are uploaded && PR created. https://github.com/PeterBee97/incubator-nuttx/tree/feature/license-clearing-tools/tools/license-clearing

adamfeuer commented 4 years ago

@PeterBee97 Cool, thanks– I didn't realize the script already used git to find the authors, sorry for missing that. We will need all the authors, not just the top 3. I'll take a closer look tomorrow.

Re: Xiaomi and Pinecone already approving the license change, do you know if they have filed an Apache Software Grant Agreement (SGA)?

Would you be willing to run your tool on fs and mm directories, and see if you can extract a report of the authors for each section and file? That way we can see if we're dealing with 10 authors, 100 authors, etc.

I think another next step is to get you an account on the NuttX Fossology instance. At some point we'll need to get the data into there. I'll email Brennan and you on the list.

Thanks again for being willing to help with this!

PeterBee97 commented 4 years ago

@PeterBee97 Cool, thanks– I didn't realize the script already used git to find the authors, sorry for missing that. We will need all the authors, not just the top 3. I'll take a closer look tomorrow.

Re: Xiaomi and Pinecone already approving the license change, do you know if they have filed an Apache Software Grant Agreement (SGA)?

Would you be willing to run your tool on fs and mm directories, and see if you can extract a report of the authors for each section and file? That way we can see if we're dealing with 10 authors, 100 authors, etc.

I think another next step is to get you an account on the NuttX Fossology instance. At some point we'll need to get the data into there. I'll email Brennan and you on the list.

Thanks again for being willing to help with this!

Top 3 was my idea, given that some 1 commit contributors can be ignored(can't they?). For license issue I don't know exactly the details, @xiaoxiang781216 knows better. I ran the tool on the whole proj already so those two directories can just be filtered. I'll try to get a report for particular files. You're welcome :)

patacongo commented 4 years ago

@PeterBee97 https://github.com/PeterBee97 Cool, thanks??? I didn't realize the script already used git to find the authors, sorry for missing that. We will need all the authors, not just the top 3. I'll take a closer look tomorrow.

I mentioned this before, but it bears repeating. The NuttX project was 13 years old in February of 2010. For the first 6 to 6 and a half years, the project used CVS and SVN. You will find no authorship or contact information for the first half of the project's life in the current GIT authors. The log will show me as the sole author for during that time.

I did by far most the changes in those days, but not all. Prior to GIT, contributors were noted only in commit comments. It should be possible to get the names, or in most cases just user handles, from the comments but with no contact information.

Github apparently does not even know how to parse that early activity. If you look at https://github.com/apache/incubator-nuttx/graphs/contributors you would conclude that the project has only existed since sometime in 2013. The project was actually created in February of 2007. This is clearer in the Bitbucket statistics[1]: https://bitbucket.org/nuttx/nuttx/addon/bitbucket-graphs/graphs-repo-page#!graph=contributors&uuid=4430abf9-a782-49ff-bd16-bc1df696048e&type=c&group=weeks which goes all the way back to the day the project was created.

I think that is because prior to GIT, authors were NOT referenced by email address, but rather with some UUID.

[1]Note you have to be logged into Bitbucket to see the statistics there.

adamfeuer commented 4 years ago

@patacongo Are the original CVS and SVN archives saved anywhere?

patacongo commented 4 years ago

@patacongo Are the original CVS and SVN archives saved anywhere?

No

adamfeuer commented 4 years ago

@patacongo Ok. I'll see if I can look through the commit message to see if I can see what's going on there.

I'm logged in to Bitbucket, but for some reason I can't view the graph link you posted. Maybe it's a permissions issue or I don't have access to the graphs addon?

xiaoxiang781216 commented 4 years ago

@PeterBee97 https://github.com/PeterBee97 Cool, thanks??? I didn't realize the script already used git to find the authors, sorry for missing that. We will need all the authors, not just the top 3. I'll take a closer look tomorrow. I mentioned this before, but it bears repeating. The NuttX project was 13 years old in February of 2010. For the first 6 to 6 and a half years, the project used CVS and SVN. You will find no authorship or contact information for the first half of the project's life in the current GIT authors. The log will show me as the sole author for during that time. I did by far most the changes in those days, but not all. Prior to GIT, contributors were noted only in commit comments. It should be possible to get the names, or in most cases just user handles, from the comments but with no contact information. Github apparently does not even know how to parse that early activity. If you look at https://github.com/apache/incubator-nuttx/graphs/contributors you would conclude that the project has only existed since sometime in 2013. The project was actually created in February of 2007. This is clearer in the Bitbucket statistics[1]: https://bitbucket.org/nuttx/nuttx/addon/bitbucket-graphs/graphs-repo-page#!graph=contributors&uuid=4430abf9-a782-49ff-bd16-bc1df696048e&type=c&group=weeks which goes all the way back to the day the project was created. I think that is because prior to GIT, authors were NOT referenced by email address, but rather with some UUID. [1]Note you have to be logged into Bitbucket to see the statistics there.

@PeterBee97 can we add a column in the database to indicate the source code exist before git is used? @patacongo, we need gather the statistics information first and convert the unambiguous code base automatically(of course we need review the PR carefully) and then work on the rest case by case, otherwise NuttX can never become the TOP LEVEL PROJECT.

adamfeuer commented 4 years ago

@xiaoxiang781216 @patacongo @PeterBee97 I cloned the Bitbucket repo last night (https://bitbucket.org/nuttx/nuttx/src/master/), looked through the commit logs, and I can see what @patacongo is talking about. I didn't compare to the github log, but we should probably also do that. Then we can see if we can do anything with the information there.

It seems like we should be able to come up with a strategy for dealing with this:

If we can get names and contact info from the commit messages, then we can run the license clearing process we already have, maybe with some additional steps about that process.
At the very least, we can collect statistics about how many contributors we are talking about.
If we can't get names and contact info from the commit messages, then we need to get help to address what @xiaoxiang781216 is talking about, so NuttX can graduate from podling status. Surely other Apache projects have faced this same issue.

Let me know if you have other thoughts about this.

@PeterBee97 Will you clone the Bitbucket repo and look at the logs to see if you have some insight about it?

patacongo commented 4 years ago

This is also informative:

git log | grep author

The will produce over 30 thousand lines but you clearly see that the last several thousand commits have author:

patacongo patacongo@42af7a65-404d-4744-a932-0658087f49c3

That, I think is a bogus email that was created when the SVN repository was converted to GIT.

Then there are several thousand with author:

Gregory Nutt gnutt@nuttx.org

That is GIT, but when I was still using GIT as though it were SVN with no authors.

The first author that is not me appears at:

commit b0507038494cd1ae9d14807db758d4e3ae98a1ef
Author: jeditekunum <jeditekunum@gmail.com>
Date:   Sat Jan 24 14:31:35 2015 -0600

First step at porting to MoteinoMEGA.  LED shows assert failure at boot.  Appears to be short double blink, short off (~1sec), followed by 250ms toggle cycles.  Most of it derived from amber board.

So it appears that there is authorship information for the first 8 years. Only for the last 5 years.

adamfeuer commented 4 years ago

@patacongo @PeterBee97 If do git log --reverse and search for ' by ' I find commits like this:

commit f03cb0ff3ababdcc84245d75d795ab956d110e09
Author: patacongo <patacongo@42af7a65-404d-4744-a932-0658087f49c3>
Date:   Tue Mar 16 00:53:32 2010 +0000

    Bugfixes submitted by David Hewson

    git-svn-id: svn://svn.code.sf.net/p/nuttx/code/trunk@2543 42af7a65-404d-4744-a932-0658087f49c3

There are others. They seem to indicate patches or other code from contributors, committed by Greg.

adamfeuer commented 4 years ago

@patacongo Thanks for pointing this out again, I am sorry I didn't remember this.

patacongo commented 4 years ago

Bugfixes submitted by David Hewson

David Hewson I know. We are connected on LinkedIn. He just started working for HPE. He did a some of the LPC31 port in the 2010 timeframe but has not been involved significantly since.

patacongo commented 4 years ago

If do git log --reverse and search for ' by ' I find commits like this

"by" or "from" would both be good search keys. I also recorded the authors in the old ChangeLog files that were recently removed from the repositories because they are not used in the current workflow. That should be a complete list of authors except for a few trivial things like typo fixes that weren't normally included in the ChangeLog.

PeterBee97 commented 4 years ago

@PeterBee97 Will you clone the Bitbucket repo and look at the logs to see if you have some insight about it?

I cloned the bitbucket repo today but the git log seems to be the same with that on GitHub...

So I found the latest ChangeLog from NuttX 9.0.0 RC0 and tried to filter out the names with keywords from|by and the help of some NLP library and put the results in names-changelog.txt. Also processed the git log in the same way and the result is names-gitlog.txt. Still the commit messages of earlier SVN commits are incomplete and many commits are authorless.

This may help cover some corner cases. Maybe we can open an issue and mention these users? But before that let's filter out the "safe" files first as @xiaoxiang781216 suggests.

adamfeuer commented 4 years ago

@PeterBee97 That's great! Less than 450 names in each file. The next steps are probably:

remove all the non-human-names (Atmel, CONFIG_SDIO_PREFLIGHT, etc.)
remove all the name of committers (they have ICLAs) - I manually made a list of committers
remove duplicates (may need to be done manually since there are typos in the names
merge the lists

Once this is done, it will give us a scope of how many people there are. Ideally we'd have a list of commits for each name, and only to the top N contributors... not sure what N should be, but looking at the data should tell us. Do you have an idea how to get a list of commits per name?

patacongo commented 4 years ago

I cloned the bitbucket repo today but the git log seems to be the same with that on GitHub...

Yes, the Bitbucket repositories are read-only mirrors of the incubator repositories.

patacongo commented 4 years ago

* I manually made a [list of committers](https://github.com/starcat-io/incubator-nuttx/blob/feature/license-clearing-tools/tools/license-clearing/committers.txt)

A large number of people do not use there names on PRs or commits, but rather some username/handle. A few of these I know. For example, v01d is Matias Nitshe, raidenpl is Mateusz Szafoni. Both Matias and Mateusz are Committers. But there are many more that I don't know.

adamfeuer commented 4 years ago

@patacongo Yes– we should find a way to update the committer list and the contributor list with handles... I'll think of some ways to do that...

PeterBee97 commented 4 years ago

@PeterBee97 That's great! Less than 450 names in each file. The next steps are probably:

remove all the non-human-names (Atmel, CONFIG_SDIO_PREFLIGHT, etc.)

remove all the name of committers (they have ICLAs) - I manually made a list of committers

remove duplicates (may need to be done manually since there are typos in the names

merge the lists

Once this is done, it will give us a scope of how many people there are. Ideally we'd have a list of commits for each name, and only to the top N contributors... not sure what N should be, but looking at the data should tell us. Do you have an idea how to get a list of commits per name?

I used this script to get the list from git log and earlier commits by @patacongo :

git log --no-merges --author=patacongo --pretty=format:"%h %s" > gp.txt
cat ng2.txt | xargs -n 1 -I pp grep "pp" gp.txt > commits-patacongo.txt
./name-commits.sh ng2.txt name-commits.txt commits-patacongo.txt

Result:(I didn't exclude enlisted committers yet) https://github.com/PeterBee97/authors-tool/blob/master/name-commits-full.txt The names with no commits may be issue reporters' names, or names of committers who only contributed to the apps repo (I only ran the above commands in nuttx repo). Also some names are mentioned in ChangeLog, but sadly there's no commit authored by or mentioning them.

Apache9 commented 4 years ago

Any updates here? I think this is only blocker issue to prevert us graduate, let's try to make progress.

Thanks.

adamfeuer commented 4 years ago

@Apache9 No progress since the last update, I've been busy with other things. I'll merge @PeterBee97's code today. Next we should generate a list of people and the total lines of code for each person. Then we could sort in reverse order and decide how many people we need to try to contact.

@PeterBee97 Can you help with this? Can you find out how many lines of code were in each commit, tie them to a person in our list, create a list that combines all lines of code for each person, and create a CSV sorted in reverse order by total lines of code contributed?

xiaoxiang781216 commented 4 years ago

@adamfeuer how about we convert the source code which satisfy: 1.The first commit come from git not svn or cvs 2.The copyright owner in source code already sign SGA or ICLA 3.All contributor from git log already sign SGA or ICLA

adamfeuer commented 4 years ago

@xiaoxiang781216 That would be a good first step for the conversion process. But as discussed on the mailing list, I thought we first wanted to do a rough total estimate of the entire project?

If we want to do both in parallel, then I think your idea will be a good start. We would need:

list of all contributors who have signed SGA or ICLA - right now we only have committers who I presume have signed ICLAs. I don't know how to get the complete list, do you?
list of all files for which
- only ICLA committers are in the git log
- first commit is not from svn or cvs
- file's author headers match git author or author listed in git commit message

patacongo commented 4 years ago

My recollection is not 100% clear, but I am recalling that @justinmclean mentioned in very early phases of this project that there were some legacy changes that could be just grandfathered in without following the full IP clearance process. I understood that this was necessary for other large, established Incubator projects as well.

If my understanding is correct, then I propose that we take get permission to "cut some corners" on the pre-GIT changes that have no author associated with the individual commits. In most cases, the author of those early changes will be noted as an author or copyright holder in the BSD license header. In fact, I think that is true of all significant early code contributions. I would propose that we only use the GIT author changes for any automated analysis.

pre-GIT means pre-2014 so we are referring to very old changes.

Resolution of any remaining issues in the license headers will have to be a largely manual process anyway. We will have to examine each BSD license header and resolve all authors and copyright claims anyway. This should include all of the significant, pre-GIT changes. So I think with my suggestion here, the job can be made doable and there will be no loss of authorship on any significant contributions.

patacongo commented 4 years ago

In most cases, the author of those early changes will be noted as an author or copyright holder in the BSD license header. In fact, I think that is true of all significant early code contributions.

I can think of one very frequent case where this is not true. In many cases, people clone files from one location to another. This is particularly true under arch/ and boards/. You will discover many files that I wrote, that have me as the copyright holder and author but GIT will claim, incorrectly, that the person doing the PR/patch was the author. This will apply to several hundred files. There are cases where the info in the license header is more accurate than in the file header.

Third party code brought into the OS will have the same issue. The true author of the code is in the license header, not in the GIT log.

And there are places where people make mistakes in copying files without updating the license headers. For example, under net/ there are a few files that include some small bits of logic from Adam Dunkels. I see that those files with headers have been cloned numerous times and most are no longer correct. Adam Dunkels is not the author of any of the files under net/ (except perhaps some logic under net/sixlowpan and the TCP state machine and even those are very highly customized).

It is all very complex and we cannot expect to get it all 100% correct. I think we just have to keep a high level of integrity and do our best effort to discover and document all authorship.

I think the point is that GIT authors may not agree with the authorship in the license header and those will all need some clarification.

adamfeuer commented 4 years ago

@xiaoxiang781216 @patacongo I updated my comment above to include "file's author headers match git author or author listed in git commit message" – that handles the cases where things would match up easily.

Yes, there are a bunch of files that won't match up or are confusing... I think we just need to get a count of how many there are to see what it will take to track down the ones that matter.

patacongo commented 4 years ago

Any updates here? I think this is only blocker issue to prevert us graduate, let's try to make progress.

It seems to me that there are people who have interest and good ideas but there is not significant progress being made. The job is really two large for a couple of people to accomplish working now and then.

protobits commented 4 years ago

Could we start with the easy cases? I feel that reducing the size of the problem also makes it less intimidating to approach. We are already manually changing headers from BSD to apache for files whose authors are commiters with ICLAs so I think making an automated pass for this case should not be that hard: parse header for authors, see if all are commiters, replace with apache header. If that sounds right I can script that and give it a try.

What confuses me though is that we're worrying about git authors whereas I believe that if someone contributes a file without listing themselves as the authors in the header (for the BSD case), didn't the author concede rights over the code by doing so? At least that was my understanding at the time when I submitted patches to existing files and I did not include an extra line to add me as author to every affected file. In case this is not the correct assumption, I agree that a "best effort" approach (by comparing git author to authors on header) is the only remaining possibility.

justinmclean commented 4 years ago

Hi

What confuses me though is that we're worrying about git authors whereas I believe that if someone contributes a file without listing themselves as the authors in the header (for the BSD case), didn't the author concede rights over the code by doing so?

Without an ICLA (or an equivalent) this is not the case. Copyright automatically applies. They may not even own rights to the code they commit if their employment contract says otherwise.

Thanks, Justin

justinmclean commented 4 years ago

Hi,

BTW Apache doesn’t use author tags in any new code, doing so implies ownership by a person rather than the whole project.

Thanks, Justin

xiaoxiang781216 commented 4 years ago

Hi What confuses me though is that we're worrying about git authors whereas I believe that if someone contributes a file without listing themselves as the authors in the header (for the BSD case), didn't the author concede rights over the code by doing so? Without an ICLA (or an equivalent) this is not the case. Copyright automatically applies. They may not even own rights to the code they commit if their employment contract says otherwise. Thanks, Justin

So @justinmclean is it safe we do the batch conversion if the source code meet all following critieria? 1.The source code isn't converted from SVN or CVS 2.All commiters(or his company) in git log sign ICLA or SGA 3.The copyright holder in the source code sign ICLA or SGA And I also have one queston: do we need the contributor to sign ICLA if he/she just modify a small portion of code(e.g. ~10 lines)? The quantity number is also important to write an automation tools .

justinmclean commented 4 years ago

Hi,

So @justinmclean https://github.com/justinmclean is it safe we do the batch conversion if the source code meet all following critieria? 1.The source code isn't converted from SVN or CVS

I’m not sure what you mean by that. 2.All commiters(or his company) in git log sign ICLA or SGA

Small contributions don’t have to have a CLA, but the person who committed that contribution takes responsibility for ensuring teh code’s IP. If possible it's best to have one. 3.The copyright holder in the source code sign ICLA or SGA

Take care with this. The copyright holder in source may or may not be the correct one.

Thanks, Justin

patacongo commented 4 years ago

Take care with this. The copyright holder in source may or may not be the correct one.

Similarly, the author in GIT may not be the author of the file. Often the copyright holder in the source file header is the correct one, even though that person many not appear in GIT history.

Many people copy files that wrote into different locations (very often for new architectures and for new boards which are very similar to older architectures and boards). Very often, I am the author of the file in these cases.

Bottom line: There is no magic, automated way to correct determine the author. It requires collecting data and then also applying human insight.

@justinmclean https://github.com/justinmclean For many cases there are multiple contributors of changes to a file. There is an original author, the original committer (who might be a different person) and people who have made trivial changes (as trivial as a spelling fix) or who have made substantial enhancements or re-designs. The former would not be treated as authors or copyright holders, but the latter may be. Is there any rule of thumb for what constitutes a significant change warranting rights to the file? Or does this also require human insight.

There are thousands of files involved here. This is potentially multiple man years of effort. I don't see how we can ever accomplish this.

protobits commented 4 years ago

We can only operate on the information we have. If authorship information was lost from CVS and SVN era (git author is Greg) and the header does not list anyone else than Greg, we can either "play safe" and leave the BSD header (we would respecting original authors license even if we don't know who it really was) or assume that without further information the original author cannot prove authorship either then we are safe to change to Apache. For these "unknown" cases, I don't see any other way. We just need to decide and then act.

For other cases where there is indeed information I think we can script a header change based on various scenarios of git author/header author/author aliases where all have ICLAs. This change can be made to create one commit per file change and add the reason for the safety of the change to the commit message for traceability. Then, we can review each commit in a PR and decide if manual intervention is needed (throwing out unsafe changes, for example).

patacongo commented 4 years ago

We can only operate on the information we have. If authorship information was lost from CVS and SVN era (git author is Greg) and the header does not list anyone else than Greg, we can either "play safe" and leave the BSD header (we would respecting original authors license even if we don't know who it really was) or assume that without further information the original author cannot prove authorship either then we are safe to change to Apache. For these "unknown" cases, I don't see any other way. We just need to decide and then act.

In the SVN/CVS days, I did always give credit to the contributor in comments. However, the task of reading all comments in those 15 thousand or so commits is a very onerous task. The information is there, just not easily accessible.

AFAIK there are no un-credited changes in the repositories.

protobits commented 4 years ago

We can try to see what wording you used in general and use some regular expression to try to match the attribution.

What I'm thinking is that in any case we will always need to analyze a file by looking at its complete git history to extract git author + header author + commit msg attribution right? The "easy" cases would then be files only touched by current commiters.

justinmclean commented 4 years ago

Hi,

@justinmclean https://github.com/justinmclean https://github.com/justinmclean https://github.com/justinmclean For many cases there are multiple contributors of changes to a file. There is an original author, the original committer (who might be a different person) and people who have made trivial changes (as trivial as a spelling fix) or who have made substantial enhancements or re-designs.

Ideally we wold have CLAs for those who have made significant changes or who owned the IP on the original contribution, whose owner may or may not be the author. There are thousands of files involved here. This is potentially multiple man years of effort. I don't see how we can ever accomplish this.

I would try solving for the low hanging fruit e.g files you know that only people who currently have CLA have contributed to and work from there and change the licenses to ALv2. I think this has already been suggested. Other code is under a compatible license so that’s the fallback position.

Thanks, Justin

apache / nuttx

IP Clearance #128