apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.58k stars 1.01k forks source link

Move dictionary for Ukrainian analyzer to external dependency [LUCENE-7785] #8836

Closed asfimport closed 7 years ago

asfimport commented 7 years ago

Currently the dictionary for Ukrainian analyzer is a blob in the source tree. We should move it out to external dependency, this allows:


Migrated from LUCENE-7785 by Andriy Rysin, resolved Apr 19 2017

asfimport commented 7 years ago

ASF GitHub Bot (migrated from JIRA)

GitHub user arysin opened a pull request:

https://github.com/apache/lucene-solr/pull/187

LUCENE-7785: Move dictionary for Ukrainan analyzer to external dependency

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/arysin/lucene-solr master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/lucene-solr/pull/187.patch

To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message:

This closes `#187`

commit f7291006b378e11cb67d9078efe52e9ace9e3f47 Author: Andriy Rysin <arysin@gmail.com> Date: 2017-03-25T22:15:56Z

LUCENE-7785: Move dictionary for Ukrainan analyzer to external dependency

asfimport commented 7 years ago

ASF GitHub Bot (migrated from JIRA)

Github user arysin commented on the issue:

https://github.com/apache/lucene-solr/pull/187

It would be also nice to merge this into 6x branch
asfimport commented 7 years ago

ASF GitHub Bot (migrated from JIRA)

Github user dweiss commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/187#discussion_r111553388

— Diff: lucene/analysis/morfologik/src/java/org/apache/lucene/analysis/uk/UkrainianMorfologikAnalyzer.java —
`@@` -107,11 +107,18 `@@` public UkrainianMorfologikAnalyzer(CharArraySet stopwords, CharArraySet stemExcl
   `@Override`
   protected Reader initReader(String fieldName, Reader reader) {
     NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
+    // different apostrophes
     builder.add("\u2019", "'");
+    builder.add("\u0218", "'");
     builder.add("\u02BC", "'");
+    builder.add("`", "'");
+    builder.add("´", "'");
+    // ignored characters
     builder.add("\u0301", "");
asfimport commented 7 years ago

ASF GitHub Bot (migrated from JIRA)

Github user dweiss commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/187#discussion_r111553610

— Diff: lucene/analysis/morfologik/src/java/org/apache/lucene/analysis/uk/UkrainianMorfologikAnalyzer.java —
`@@` -145,7 +152,7 `@@` protected TokenStreamComponents createComponents(String fieldName) {

   private static Dictionary getDictionary() {
     try {
asfimport commented 7 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Hi Andriy. The patch looks good overall, but run ant precommit; most likely you're missing jar checksums and licensing information (see top-level ant tasks and regenerate).

A separate patch file for master and 6x would be handy.

asfimport commented 7 years ago

ASF GitHub Bot (migrated from JIRA)

Github user sarowe commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/187#discussion_r111574446

— Diff: lucene/ivy-versions.properties —
`@@` -203,6 +203,8 `@@` org.carrot2.morfologik.version = 2.1.1
 /org.carrot2/morfologik-polish = ${org.carrot2.morfologik.version}
 /org.carrot2/morfologik-stemming = ${org.carrot2.morfologik.version}

+ua.net.nlp.morfologik-ukrainian-search.version = 3.7.4
— End diff –

This is fine, but you're missing the `/org/name=version` thing that's required to be used in `ivy.xml` files.

If you look at the rest of this file, you can see that some `/org/name` keys share the same version, and so a variable is introduced to simplify maintenance.  In your case, when there is only one `/org/name` key using this version, you should IMO eliminate the `ua.net.nlp.morfologik-ukrainian-search.version` variable entirely and add `/ua.net.nlp/morfologik-ukrainian-search = 3.7.4`.
asfimport commented 7 years ago

ASF GitHub Bot (migrated from JIRA)

Github user sarowe commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/187#discussion_r111573637

— Diff: lucene/analysis/morfologik/ivy.xml —
`@@` -25,6 +25,7 `@@`
     &lt;dependency org="org.carrot2" name="morfologik-polish" rev="${/org.carrot2/morfologik-polish}" conf="compile"/&gt;
     &lt;dependency org="org.carrot2" name="morfologik-fsa" rev="${/org.carrot2/morfologik-fsa}" conf="compile"/&gt;
     &lt;dependency org="org.carrot2" name="morfologik-stemming" rev="${/org.carrot2/morfologik-stemming}" conf="compile"/&gt;
+    &lt;dependency org="ua.net.nlp" name="morfologik-ukrainian-search" rev="${ua.net.nlp.morfologik-ukrainian-search.version}" conf="compile"/&gt;
— End diff –

The `rev` attribute value has to be of the form `"${/org/name}"` - this will not pass precommit.  You have to make such an entry in `ivy-versions.properties` - look at all the other examples there.
asfimport commented 7 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

-1 to commit this change.

The licenses referenced in the POM for ua.net.nlp:morfologik-ukrainian-search:3.7.4 are:

 <licenses>
    <license>
      <name>GNU Lesser General Public License</name>
      <url>http://www.gnu.org/licenses/lgpl.txt</url>
    </license>
    <license>
      <name>Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0)</name>
      <url>http://creativecommons.org/licenses/by-nc-sa/4.0</url>
    </license>
  </licenses>

Neither of those licenses are acceptable for inclusion in an Apache software distribution.

Specifically, the link http://creativecommons.org/licenses/by-nc-sa/4.0 is to the non-commercial CC-BY-SA variant, which is not among those listed as acceptable here: https://www.apache.org/legal/resolved.html#cc-sa.

asfimport commented 7 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Steven is right – we can't accept these licenses, Andriy.

asfimport commented 7 years ago

ASF GitHub Bot (migrated from JIRA)

Github user arysin commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/187#discussion_r111595706

— Diff: lucene/analysis/morfologik/src/java/org/apache/lucene/analysis/uk/UkrainianMorfologikAnalyzer.java —
`@@` -107,11 +107,18 `@@` public UkrainianMorfologikAnalyzer(CharArraySet stopwords, CharArraySet stemExcl
   `@Override`
   protected Reader initReader(String fieldName, Reader reader) {
     NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
+    // different apostrophes
     builder.add("\u2019", "'");
+    builder.add("\u0218", "'");
     builder.add("\u02BC", "'");
+    builder.add("`", "'");
+    builder.add("´", "'");
+    // ignored characters
     builder.add("\u0301", "");
asfimport commented 7 years ago

ASF GitHub Bot (migrated from JIRA)

Github user arysin commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/187#discussion_r111596732

— Diff: lucene/analysis/morfologik/src/java/org/apache/lucene/analysis/uk/UkrainianMorfologikAnalyzer.java —
`@@` -145,7 +152,7 `@@` protected TokenStreamComponents createComponents(String fieldName) {

   private static Dictionary getDictionary() {
     try {
asfimport commented 7 years ago

ASF GitHub Bot (migrated from JIRA)

Github user arysin commented on the issue:

https://github.com/apache/lucene-solr/pull/187

Thanks for the prompt feedback guys, I think I've addressed all the issues with the latest commits but please let me know if something is still wrong.
Once we're good with master branch I'll prepare the 6x changes.
asfimport commented 7 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

I see that the POM for ua.net.nlp:morfologik-ukrainian-search:3.7.5 adds the v2 Apache license. I hereby remove my -1.

Please let me know if something is still wrong.

I'm pretty sure the current patch won't pass ant precommit - have you tried running that yet Andriy? There are missing files in lucene/licenses/ for the new dependency: a checksum file (\*.sha1), a LICENSE file and a NOTICE file. You can generate the checksum file using ant jar-checksums, but you'll need to manually create the other two. You can create morfologik-ukrainian-search.LICENSE-ASL.txt by copying from another file in that directory that contains a clean ALv2 license, e.g. commons-codec-LICENSE-ASL.txt. I'm fuzzy on the requirements for the NOTICE file, but look at the other NOTICE files in that dir to get an idea of the range of things they contain. Some info here: http://www.apache.org/dev/licensing-howto.html#overview-of-files and here: http://www.apache.org/licenses/LICENSE-2.0.html#redistribution.

asfimport commented 7 years ago

Andriy Rysin (migrated from JIRA)

Ok, thanks for the suggestions, I was able to run ant precommit and I've added/adjusted the files to make it happier. It still fails for me but dues to some issue with /org.ccil.cowan.tagsoup/tagsoup, hopefully files related to this issue are good now.

asfimport commented 7 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

It still fails for me but dues to some issue with /org.ccil.cowan.tagsoup/tagsoup, hopefully files related to this issue are good now.

Looking at your changes to ivy-versions.properties, I think the new dependency is in the wrong order (as the comment at the top of the file says, the /org/name keys must be lexically ordered) - likely the error you're seeing is telling you that. If I'm right, perhaps the error message should be fixed. Please post it here.

asfimport commented 7 years ago

Andriy Rysin (migrated from JIRA)

Here's what I get: check-lib-versions: [echo] Lib versions check under: /home/master/work/ukr/spelling/lucene-workspace/lucene-solr/lucene/.. [libversions] :: loading settings :: file = /home/master/work/ukr/spelling/lucene-workspace/lucene-solr/lucene/top-level-ivy-settings.xml [libversions] OUT-OF-ORDER coordinate key '/org.ccil.cowan.tagsoup/tagsoup' in ivy-versions.properties [libversions] Checked that ivy-versions.properties and ivy-ignore-conflicts.properties have lexically sorted '/org/name' keys and no duplicates or orphans. [libversions] Scanned 46 ivy.xml files for rev="${/org/name}" format. [libversions] Found 0 indirect dependency version conflicts. [libversions] Completed in 1.24s., 1 error(s).

asfimport commented 7 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

I think the error message is okay - it's hard to be precise about the cause of out-of-order problems. The out-of-order checker can only tell that there is a problem after seeing the key following your new one. Here's your patch in context:

 +ua.net.nlp.morfologik-ukrainian-search.version = 3.7.5
 +/ua.net.nlp/morfologik-ukrainian-search = ${ua.net.nlp.morfologik-ukrainian-search.version}
 +
  /org.ccil.cowan.tagsoup/tagsoup = 1.2.1

FYI, we work pretty hard to make sure that ant precommit works properly all the time, so if you see a failure, it's very likely due to your changes.

asfimport commented 7 years ago

Andriy Rysin (migrated from JIRA)

Ahh, I see what you mean, I'll push the fix for the order once my ant precommit succeeds.

asfimport commented 7 years ago

Andriy Rysin (migrated from JIRA)

ant precommit is happy now

asfimport commented 7 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

+1, LGTM

asfimport commented 7 years ago

Shawn Heisey (@elyograg) (migrated from JIRA)

Interesting. I could have sworn I had read that LGPL was OK for binary dependencies, but I can see it right there on the legal page that it isn't. Learn something new every day.

asfimport commented 7 years ago

ASF GitHub Bot (migrated from JIRA)

Github user dweiss commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/187#discussion_r111922455

— Diff: lucene/licenses/morfologik-ukrainian-search-NOTICE.txt —
`@@` -0,0 +1,6 `@@`
+morfologik-ukrainian-search is a POS tag dictionary in morfolgik format adjusted for searching.
— End diff –

typo in morfolgik
asfimport commented 7 years ago

ASF GitHub Bot (migrated from JIRA)

Github user dweiss commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/187#discussion_r111922745

— Diff: lucene/licenses/morfologik-ukrainian-search-NOTICE.txt —
`@@` -0,0 +1,6 `@@`
+morfologik-ukrainian-search is a POS tag dictionary in morfolgik format adjusted for searching.
+It's part of dict_uk project (https://github.com/brown-uk/dict_uk)
+
+Note: to better fit into full-text search model this dictionary has all word forms in lower case but keeps lemmas for proper nouns in upper case.
+
+Licensed under GPL/LGPL, CC BY-NC-SA 4.0, and Apache License 2.0.
— End diff –

Hmm; not a lawyer, but this should better state "Licensed under Apache License 2.0"... Those multiple licenses don't agree with each other – at least it should state **or**, not **and**.
asfimport commented 7 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Will merge this in.

asfimport commented 7 years ago

ASF subversion and git services (migrated from JIRA)

Commit 230b3e6e4d27e2b17764814e461c7d96e947f762 in lucene-solr's branch refs/heads/branch_6x from @dweiss https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=230b3e6

LUCENE-7785: Move dictionary for Ukrainian analyzer to external dependency. (Andriy Rysin via Dawid Weiss)

asfimport commented 7 years ago

ASF subversion and git services (migrated from JIRA)

Commit e62a3ff46c855f2b0fab7c1b78895118514e23f4 in lucene-solr's branch refs/heads/master from @dweiss https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e62a3ff

LUCENE-7785: Move dictionary for Ukrainian analyzer to external dependency. (Andriy Rysin via Dawid Weiss)

asfimport commented 7 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Thanks Andriy, Steve!

asfimport commented 7 years ago

Andriy Rysin (migrated from JIRA)

Thanks Dawid! Thanks everybody for your help and feedback!

asfimport commented 6 years ago

ASF GitHub Bot (migrated from JIRA)

Github user arysin closed the pull request at:

https://github.com/apache/lucene-solr/pull/187