dkpro / dkpro-c4corpus

DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal.
https://dkpro.github.io/dkpro-c4corpus
Apache License 2.0
50 stars 8 forks source link

Clarify license for Java JusTex implementation #32

Closed tfmorris closed 8 years ago

tfmorris commented 8 years ago

The source file headers mention an original author, but make no mention of what license the "found code" was under. It would appear that the code was derived from https://github.com/duongphuhiep/justext/tree/master/JusText/src/main/java/dh/tool/justext but that repository doesn't include any license declaration, which effectively means that it's copyrighted and unusable unless a separate license or clearance was obtained.

Was a compatible license provided by the original author? If so, could a statement to that effect please be added to the relevant source files?

reckart commented 8 years ago

Are you referring to this class?: dkpro-c4corpus/dkpro-c4corpus-boilerplate/src/main/java/de/tudarmstadt/ukp/dkpro/c4corpus/boilerplate/impl/JusTextBoilerplateRemoval.java

tfmorris commented 8 years ago

The classes in this package: https://github.com/dkpro/dkpro-c4corpus/tree/master/dkpro-c4corpus-boilerplate/src/main/java/de/tudarmstadt/ukp/dkpro/c4corpus/boilerplate/impl

ParagraphExplorer, Paragraph, & NodeHelper, at least, all look like direct copies.

reckart commented 8 years ago

Indeed. And there is no NOTICE.txt file neither in the module root https://github.com/dkpro/dkpro-c4corpus/tree/master/dkpro-c4corpus-boilerplate nor in the projec root https://github.com/dkpro/dkpro-c4corpus that explains the origin of these classes either.

The status of these files must be cleared before the release.

reckart commented 8 years ago

The original justext (in Python) appears to have a BSD-like license:

Copyright (c) 2011, Jan Pomikalek <jan.pomikalek@gmail.com>

All rights reserved.

Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ''AS IS'' AND ANY
EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

That should at least make it easy for the author of the Java version to choose a liberal license.

See: https://code.google.com/archive/p/justext/source/default/source (COPYING)

tfmorris commented 8 years ago

He said he coded it while watching the World Cup and warned of the resulting quality, so it doesn't sound like he's got a huge proprietary interest in it. Hopefully he's not philosophically wedded to an incompatible license, but, even if he were, it's only a few modules to reimplement.

habernal commented 8 years ago

I'm trying to clarify that with the author explicitly.

habernal commented 8 years ago

We got the green light: https://github.com/duongphuhiep/justext/issues/1

reckart commented 8 years ago

Great! Can you please add the exact file names to the NOTICE.txt?

tfmorris commented 8 years ago

Excellent! Thanks for getting this sorted out.