apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.68k stars 1.04k forks source link

Add a knn classifier based on fuzzified term queries [LUCENE-7838] #8889

Closed asfimport closed 7 years ago

asfimport commented 7 years ago

FLT mixes fuzzy and MLT, in the context of Lucene based classification it might be useful to add such a fuzziness to a dedicated KNN classifier (based on FLT queries).


Migrated from LUCENE-7838 by Tommaso Teofili (@tteofili), resolved Jul 05 2017

asfimport commented 7 years ago

ASF subversion and git services (migrated from JIRA)

Commit bd9e32d358399af7c31e732314e1ef1dd89bcfa1 in lucene-solr's branch refs/heads/master from @tteofili https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bd9e32d

LUCENE-7838 - added knn classifier based on flt

asfimport commented 7 years ago

ASF subversion and git services (migrated from JIRA)

Commit c53d19e7b2b15fe2d9d38be3a1137339336a7f23 in lucene-solr's branch refs/heads/master from @tteofili https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c53d19e

LUCENE-7838 - removed unused import

asfimport commented 7 years ago

ASF subversion and git services (migrated from JIRA)

Commit c9bdce937a52e80174ce22f4e82a02da736b56c4 in lucene-solr's branch refs/heads/master from @jpountz https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c9bdce9

LUCENE-7838: Remove unused imports.

asfimport commented 7 years ago

ASF subversion and git services (migrated from JIRA)

Commit d30d012c7c2f9de46a32d7e9eda3b17c51a7fa04 in lucene-solr's branch refs/heads/master from Tomas Fernandez Lobbe https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d30d012

SOLR-10042, LUCENE-7838: Fix precommit

asfimport commented 7 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Ah... you added a dependency on the sandbox module from another module. That's quite surprising to me... I don't think that's legit? New inter-module dependencies (of any kind) I think should also deserve communication on the JIRA issue and I don't see any mention here. I also don't see a CHANGES.txt entry. I don't see a patch file either but I admit I welcome that :-)

asfimport commented 7 years ago

Tommaso Teofili (@tteofili) (migrated from JIRA)

you added a dependency on the sandbox module from another module. That's quite surprising to me... I don't think that's legit?

why? As soon as we provide releases of lucene-sandbox I assume we expect people and other modules to use it.

New inter-module dependencies (of any kind) I think should also deserve communication on the JIRA issue and I don't see any mention here.

Since this is only impacting master branch I had thought there was no need to explicitly mention that; on the other hand FuzzyLikeThisQuery lives in sandbox therefore I had assumed there was no need to explicitly specify that in the issue.

I also don't see a CHANGES.txt entry

right, there's no such entry.

I don't see a patch file either but I admit I welcome that

I'm not sure I get your point here, would you have expected a patch ?

asfimport commented 7 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I agree with David we should be careful about adding new dependencies. Otherwise things can quickly become hairy, eg. because of circular dependencies, or because pulling a single module would pull most other modules through transitive dependencies, which defeats the purpose of having modules. Dependeng on sandbox makes it ever worse since the barrier for adding/removing code is supposed to be low, yet this new dependency means that special care needs to be taken if we want to remove FuzzyLikeThis.

asfimport commented 7 years ago

David Smiley (@dsmiley) (migrated from JIRA)

CHANGES.txt:

I guess I need to be clearer. Why isn't there a CHANGES.txt entry? Beyond mentioning what the title says, mentioning the new dependency would be appropriate (required IMO).

patch

Nevermind; you were going the CTR path (which I welcome) instead of RTC. CTR is outside our defacto norms of behavior here. Maybe I should follow suit and we will try to change that :-)

asfimport commented 7 years ago

Tommaso Teofili (@tteofili) (migrated from JIRA)

as per related thread on dev@ I'll drop the dependency over the sandbox module which is indeed not appropriate. If possible I'd like to keep the classifier but I'd not just copy paste the FLT code from sandbox to classification therefore it'll take a bit of time to tweak it as needed.

asfimport commented 7 years ago

ASF subversion and git services (migrated from JIRA)

Commit 92e460389dc9b0af83c445cb029e3a51799a37dc in lucene-solr's branch refs/heads/master from @tteofili https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=92e4603

LUCENE-7838 - removed dep from sandbox, created a minimal FLT version specific for knn classification

asfimport commented 7 years ago

Tommaso Teofili (@tteofili) (migrated from JIRA)

I've removed the dependency on the sandbox module and created a dedicated version of FLT named NearestFuzzyQuery in org.apache.lucene.classification.utils package. The goal now is to refine NearestFuzzyQuery in order to get better classification results and remove some specifics of FLT.

asfimport commented 7 years ago

Tommaso Teofili (@tteofili) (migrated from JIRA)

I'm marking this as resolved, improvements will come in subsequent issues.

asfimport commented 7 years ago

ASF subversion and git services (migrated from JIRA)

Commit 8ccb61c0af3c38dab6f1a62eafb836fb6415e55c in lucene-solr's branch refs/heads/master from @tteofili https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8ccb61c

8874, LUCENE-7838 - added missing entires in changes.txt

asfimport commented 7 years ago

ASF subversion and git services (migrated from JIRA)

Commit 056501be8b1aed17ef2244c06c4a2c1367eba166 in lucene-solr's branch refs/heads/branch_7x from @tteofili https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=056501b

8874, LUCENE-7838 - added missing entires in changes.txt

(cherry picked from commit 8ccb61c)

asfimport commented 7 years ago

ASF subversion and git services (migrated from JIRA)

Commit 25229f21ec7b7d79c9fd7408e88290de29065672 in lucene-solr's branch refs/heads/branch_7_0 from @tteofili https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=25229f2

8874, LUCENE-7838 - added missing entires in changes.txt

(cherry picked from commit 8ccb61c)