apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.72k stars 1.04k forks source link

More Like This Params Refactor [LUCENE-8326] #9373

Open asfimport opened 6 years ago

asfimport commented 6 years ago

More Like This ca be refactored to improve the code readability, test coverage and maintenance. Scope of this Jira issue is to start the More Like This refactor from the More Like This Params. This Jira will not improve the current More Like This but just keep the same functionality with a refactored code.

Other Jira issues will follow improving the overall code readability, test coverage and maintenance.


Migrated from LUCENE-8326 by Alessandro Benedetti (@alessandrobenedetti), updated May 17 2019 Attachments: LUCENE-8326.patch (versions: 4) Linked issues:

Pull requests: https://github.com/apache/lucene-solr/pull/369

asfimport commented 6 years ago

Alessandro Benedetti (@alessandrobenedetti) (migrated from JIRA)

It is better to start small, from the params. other refactors will follow and be part of the bigger Jira issue.

asfimport commented 6 years ago

Alessandro Benedetti (@alessandrobenedetti) (migrated from JIRA)

https://patch-diff.githubusercontent.com/raw/apache/lucene-solr/pull/369.patch

asfimport commented 6 years ago

Alessandro Benedetti (@alessandrobenedetti) (migrated from JIRA)

First draft patch attached.

Following tests have been successfully verified :

org.apache.lucene.classification.* org.apache.lucene.queries.mlt.* org.apache.solr.search.mlt.* org.apache.solr.handler.component.DistributedMLTComponentTest org.apache.solr.handler.component.MoreLikeThisComponentTest org.apache.solr.handler.MoreLikeThisHandlerTest

asfimport commented 6 years ago

Alessandro Benedetti (@alessandrobenedetti) (migrated from JIRA)

It is required to merge 12304 first. This patch is built on top of 12304.

asfimport commented 6 years ago

Lucene/Solr QA (migrated from JIRA)

-1 overall
Vote Subsystem Runtime Comment
-1 patch 0m 5s SOLR-12299 does not apply to master. Rebase required? Wrong Branch? See https://wiki.apache.org/solr/HowToContribute#Creating_the_patch_file for help.
Subsystem Report/Notes
JIRA Issue SOLR-12299
JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12921792/SOLR-12299.patch
Console output https://builds.apache.org/job/PreCommit-SOLR-Build/77/console
Powered by Apache Yetus 0.7.0 http://yetus.apache.org

This message was automatically generated.

asfimport commented 6 years ago

Alessandro Benedetti (@alessandrobenedetti) (migrated from JIRA)

I am attaching the patch again. Apparently the one generated automatically from the github request was partially corrupted. Fixed now.

asfimport commented 6 years ago

Lucene/Solr QA (migrated from JIRA)

-1 overall
Vote Subsystem Runtime Comment
Prechecks
+1 test4tests 0m 0s The patch appears to include 4 new or modified test files.
master Compile Tests
+1 compile 3m 28s master passed
Patch Compile Tests
+1 compile 2m 52s the patch passed
+1 javac 2m 52s the patch passed
+1 Release audit (RAT) 0m 59s the patch passed
+1 Check forbidden APIs 0m 49s the patch passed
+1 Validate source patterns 0m 49s the patch passed
Other Tests
+1 unit 0m 53s classification in the patch passed.
+1 unit 0m 36s queries in the patch passed.
-1 unit 72m 0s core in the patch failed.
84m 8s
Reason Tests
Failed junit tests solr.cloud.ZkControllerTest
Subsystem Report/Notes
JIRA Issue SOLR-12299
JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12921948/SOLR-12299.patch
Optional Tests compile javac unit ratsources checkforbiddenapis validatesourcepatterns
uname Linux lucene1-us-west 3.13.0-88-generic #135-Ubuntu SMP Wed Jun 8 21:10:42 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Build tool ant
Personality /home/jenkins/jenkins-slave/workspace/PreCommit-SOLR-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
git revision master / 89fc02a
ant version: Apache Ant(TM) version 1.9.3 compiled on April 8 2014
Default Java 1.8.0_172
unit https://builds.apache.org/job/PreCommit-SOLR-Build/78/artifact/out/patch-unit-solr_core.txt
Test Results https://builds.apache.org/job/PreCommit-SOLR-Build/78/testReport/
modules C: lucene/classification lucene/queries solr/core U: .
Console output https://builds.apache.org/job/PreCommit-SOLR-Build/78/console
Powered by Apache Yetus 0.7.0 http://yetus.apache.org

This message was automatically generated.

asfimport commented 6 years ago

Alessandro Benedetti (@alessandrobenedetti) (migrated from JIRA)

The jenkins output is a little bit suspicious ... I can not reproduce this test failure and it look completely unrelated to me... Any view on that guys?

asfimport commented 6 years ago

Alessandro Benedetti (@alessandrobenedetti) (migrated from JIRA)

I am quite keen in moving forward the More Like This refactor, can I help in any way to proceed with a review ? Anyone out there that could help ? I have already split up the big refactor to become more review-friendly, happy to do whatever it needs to push this forward ( I would like to proceed with further developments More Like This side, but first I want the refactor to be there)

asfimport commented 6 years ago

Alessandro Benedetti (@alessandrobenedetti) (migrated from JIRA)

https://github.com/apache/lucene-solr/pull/380

asfimport commented 6 years ago

Alessandro Benedetti (@alessandrobenedetti) (migrated from JIRA)

https://github.com/apache/lucene-solr/pull/380.patch

asfimport commented 6 years ago

Lucene/Solr QA (migrated from JIRA)

-1 overall
Vote Subsystem Runtime Comment
-1 patch 0m 5s LUCENE-8326 does not apply to master. Rebase required? Wrong Branch? See https://wiki.apache.org/lucene-java/HowToContribute#Contributing_your_work for help.
Subsystem Report/Notes
JIRA Issue LUCENE-8326
JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12924536/LUCENE-8326.patch
Console output https://builds.apache.org/job/PreCommit-LUCENE-Build/13/console
Powered by Apache Yetus 0.7.0 http://yetus.apache.org

This message was automatically generated.

asfimport commented 6 years ago

Alessandro Benedetti (@alessandrobenedetti) (migrated from JIRA)

It is very annoying that the patch automatically generated from the Github Pull Request ( which is green to merge with the master) actually ends up being malformed...

I attach a fixed patch now, just built straight from command line. The Pull Request is still valid from a code review perspective.

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Sorry, it may have been my fault. I checked "patch attached" when moving the issue to let the automated checks run

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

First looking at the API change, it would be good to understand the goals.

This change wraps 8 or 9 existing setters such as setMinTermLen with a "configuration class". There is also another class related to boosts. But everything is still just as mutable as before, so from my perspective it only adds additional indirection/abstraction which is undesired.

If we want to make MLT immutable or something like that, we should first figure out if that's worth it. From my perspective, I'm not sold on this for MoreLikeThis itself, since its lightweight and stateless, and since I can't see a way for MoreLikeThisQuery to cache efficiently.

On the other hand MoreLikeThisQuery is kind of a mess, but that isn't addressed with the refactoring. Really all queries should be immutable for caching purposes, and should all have correct equals/hashcode: but seems like its a lost cause with MoreLikeThisQuery since it does strange stuff in rewrite: its not really a per-segment thing. Because of how the query works, its not obvious to me if/how we can improve it with immutability...

Also currently MoreLikeThisQuery doesn't accept MoreLikeThis as a parameter or anything, and only uses it internally. So as it stands (also with this patch) it still has a "duplicate" API of all the parameters, which isn't great.

So I think if we want to change the API for this stuff, we should figure out what the goals are? If its just to say, consolidate api between MoreLikeThis and MoreLikeThisQuery, I can buy into that (although I have never used the latter myself, only the former). However the other queries use builders for such purposes so that's probably something to consider.

For the solr changes, my only comment would be that instead of running actual queries, isn't it good enough to just test that XYZ configuration produces a correct MLT object? Otherwise the test seems fragile from my perspective.

asfimport commented 6 years ago

Alessandro Benedetti (@alessandrobenedetti) (migrated from JIRA)

Hi Robert,

first of all thanks for your feedback, much appreciated.

My initial goal was to make More Like This original 1000 lines class:

So I started identifying the different responsibilities in the More Like This class, to separate them, in the way that if I just need to change the scoring algorithm for the interesting terms I just touch the TermScorer ect ect : You see the overall idea in these slides : https://www.slideshare.net/AlessandroBenedetti/advanced-document-similarity-with-apache-lucene

I tried to modify as less as possible the logic and tests at this stage.

So let me wrap my considerations under different topics :

Parameters Abstraction I see your point for just additional indirection/abstraction ( it is exactly just that with readability in mind). My scope for this was : "We have 600 lines of code of default and parameters for the MLT. How to make them isolated, more readable and extendable ?" And I initially thought to just put them in a separate class to remove that responsibility from the original MLT . So the focus was exclusively better readability and easy maintenance at this stage. Can you elaborate why you think this is undesired ? I don't have any strong feeling regarding this bit, so I am open to suggestions with the forementioned objective in mind.

MLT Immutable I didn't consider it , but I am completely open to do that. In such case it could be worth to add a MoreLikeThis factory that manages the parameters and create the immutable MLT object ?

MoreLikeThisQuery It was not in the scope of this refactor but I am absolutely happy to tackle that immediately as a next step, it could give it a try to see if there is space for improvement there.

Solr Tests

I completely agree, indeed as part of my additional tests which I have ready for the sequent refactors I introducedmuch more tests Lucene side than Solr side. We can elaborate this further at the right moment

asfimport commented 6 years ago

Lucene/Solr QA (migrated from JIRA)

+1 overall
Vote Subsystem Runtime Comment
Prechecks
+1 test4tests 0m 0s The patch appears to include 4 new or modified test files.
master Compile Tests
+1 compile 3m 1s master passed
Patch Compile Tests
+1 compile 2m 52s the patch passed
+1 javac 2m 52s the patch passed
+1 Release audit (RAT) 1m 0s the patch passed
+1 Check forbidden APIs 0m 50s the patch passed
+1 Validate source patterns 0m 50s the patch passed
Other Tests
+1 unit 0m 59s classification in the patch passed.
+1 unit 0m 36s queries in the patch passed.
+1 unit 71m 31s core in the patch passed.
83m 10s
Subsystem Report/Notes
JIRA Issue LUCENE-8326
JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12924770/LUCENE-8326.patch
Optional Tests compile javac unit ratsources checkforbiddenapis validatesourcepatterns
uname Linux lucene1-us-west 3.13.0-88-generic #135-Ubuntu SMP Wed Jun 8 21:10:42 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Build tool ant
Personality /home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
git revision master / 71ed5ba
ant version: Apache Ant(TM) version 1.9.3 compiled on April 8 2014
Default Java 1.8.0_172
Test Results https://builds.apache.org/job/PreCommit-LUCENE-Build/16/testReport/
modules C: lucene/classification lucene/queries solr/core U: .
Console output https://builds.apache.org/job/PreCommit-LUCENE-Build/16/console
Powered by Apache Yetus 0.7.0 http://yetus.apache.org

This message was automatically generated.

asfimport commented 6 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Ciao Alessandro. Just FYI: I've committed two (rather small) changes to the MLT class today that may require a rebase/ fixup to this patch.

asfimport commented 6 years ago

Alessandro Benedetti (@alessandrobenedetti) (migrated from JIRA)

Hi @dweiss, I am fixing the patch! In the meantime , what does this mean ? :

Cheers

asfimport commented 6 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

This means the relative cutoff is essentially setting the absolute cutoff value; these are not two separate conditions.

asfimport commented 6 years ago

Alessandro Benedetti (@alessandrobenedetti) (migrated from JIRA)

https://github.com/apache/lucene-solr/pull/380 has been updated

New patch attached. Thanks @dweiss for letting us know ! If you are interested in this issue any feedback is more than welcome!

asfimport commented 6 years ago

Lucene/Solr QA (migrated from JIRA)

-1 overall
Vote Subsystem Runtime Comment
Prechecks
+1 test4tests 0m 0s The patch appears to include 4 new or modified test files.
master Compile Tests
+1 compile 3m 15s master passed
Patch Compile Tests
+1 compile 3m 46s the patch passed
+1 javac 3m 46s the patch passed
+1 Release audit (RAT) 1m 5s the patch passed
+1 Check forbidden APIs 0m 52s the patch passed
+1 Validate source patterns 0m 52s the patch passed
Other Tests
+1 unit 0m 51s classification in the patch passed.
+1 unit 0m 36s queries in the patch passed.
-1 unit 76m 25s core in the patch failed.
89m 30s
Reason Tests
Failed junit tests solr.handler.component.SearchHandlerTest
solr.cloud.TestSolrCloudWithDelegationTokens
Subsystem Report/Notes
JIRA Issue LUCENE-8326
JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12925161/LUCENE-8326.patch
Optional Tests compile javac unit ratsources checkforbiddenapis validatesourcepatterns
uname Linux lucene1-us-west 3.13.0-88-generic #135-Ubuntu SMP Wed Jun 8 21:10:42 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Build tool ant
Personality /home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
git revision master / fd929c1
ant version: Apache Ant(TM) version 1.9.3 compiled on April 8 2014
Default Java 1.8.0_172
unit https://builds.apache.org/job/PreCommit-LUCENE-Build/17/artifact/out/patch-unit-solr_core.txt
Test Results https://builds.apache.org/job/PreCommit-LUCENE-Build/17/testReport/
modules C: lucene/classification lucene/queries solr/core U: .
Console output https://builds.apache.org/job/PreCommit-LUCENE-Build/17/console
Powered by Apache Yetus 0.7.0 http://yetus.apache.org

This message was automatically generated.

asfimport commented 6 years ago

Alessandro Benedetti (@alessandrobenedetti) (migrated from JIRA)

Failed - org.apache.solr.handler.component.SearchHandlerTest.testRequireZkConnectedDistrib It's not marked as BadApple but it doesn't seem related to this patch at all to me

Regression - org.apache.solr.cloud.TestSolrCloudWithDelegationTokens.testDelegationTokenRenew It's not marked as BadApple but it doesn't seem related to this patch at all to me

asfimport commented 6 years ago

Alessandro Benedetti (@alessandrobenedetti) (migrated from JIRA)

Any consideration on my answer from the 24/05 ? I would like to move this forward, happy to have a constructive discussion and revisiting the approach if necessary :)

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I feel the same way as before: we shouldn't split up a class in a user-impacting way just because its 1000 lines of code. To the user it does not matter, they just see one class with getters and setters and its easy.

If it really needs to be split up, can we try to do it in a way that doesn't impact users, e.g. move some code into package-private implementation classes?

asfimport commented 6 years ago

Alessandro Benedetti (@alessandrobenedetti) (migrated from JIRA)

Ok Robert, I see your point.

I wouldn't say it is a critical part the parameters split up ( while I do believe the interesting terms retrieval and interesting terms scoring is, but this will be a later on discussion).

Let me spend some time looking for a more balanced and conveniente solution that makes a good compromise. I will update this Jira as soon as I have a new patch/ pull request. Thank you for your time again!

asfimport commented 6 years ago

Alessandro Benedetti (@alessandrobenedetti) (migrated from JIRA)

I just attached a revised patch and Pull Request :

1) impact on user side has been considerably reduced keeping getters and setters in the MLT main class and parameters in the external class 2) Parameters with defaults have their own class, making them easy to maintain and read, it will be  easy to pass them to inner MLT modules when the refactor continues 3) Boost logic is still away from users of the MLT, this keeps the responsibility of managing boost, MLT side, open to discuss this 4) the patch is built on top of SOLR-12304 bugfix, that should go IN first anyway

Happy to revise further, if necessary and follow up with the following refactors ( in separate Jira issues)

asfimport commented 6 years ago

Lucene/Solr QA (migrated from JIRA)

-1 overall
Vote Subsystem Runtime Comment
Prechecks
+1 test4tests 0m 0s The patch appears to include 4 new or modified test files.
master Compile Tests
+1 compile 3m 12s master passed
Patch Compile Tests
+1 compile 2m 49s the patch passed
+1 javac 2m 49s the patch passed
+1 Release audit (RAT) 0m 59s the patch passed
+1 Check forbidden APIs 0m 49s the patch passed
+1 Validate source patterns 0m 49s the patch passed
Other Tests
+1 unit 0m 51s classification in the patch passed.
+1 unit 0m 39s queries in the patch passed.
-1 unit 78m 46s core in the patch failed.
90m 38s
Reason Tests
Failed junit tests solr.cloud.autoscaling.sim.TestTriggerIntegration
Subsystem Report/Notes
JIRA Issue LUCENE-8326
JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12926768/LUCENE-8326.patch
Optional Tests compile javac unit ratsources checkforbiddenapis validatesourcepatterns
uname Linux lucene1-us-west 3.13.0-88-generic #135-Ubuntu SMP Wed Jun 8 21:10:42 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Build tool ant
Personality /home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
git revision master / a4fa168
ant version: Apache Ant(TM) version 1.9.3 compiled on April 8 2014
Default Java 1.8.0_172
unit https://builds.apache.org/job/PreCommit-LUCENE-Build/29/artifact/out/patch-unit-solr_core.txt
Test Results https://builds.apache.org/job/PreCommit-LUCENE-Build/29/testReport/
modules C: lucene/classification lucene/queries solr/core U: .
Console output https://builds.apache.org/job/PreCommit-LUCENE-Build/29/console
Powered by Apache Yetus 0.7.0 http://yetus.apache.org

This message was automatically generated.

asfimport commented 6 years ago

Alessandro Benedetti (@alessandrobenedetti) (migrated from JIRA)

Latest Jenkins failure doesn't seem to be related with the latest updates to the patch.

asfimport commented 5 years ago

Alessandro Benedetti (@alessandrobenedetti) (migrated from JIRA)

Anyone interested in helping me in refactoring and improving the MLT step by step? This is the first step, in the perspective of cleaning the code and make it more maintainable and extendable. Happy to help and to support the changes necessary.

asfimport commented 5 years ago

Lucene/Solr QA (migrated from JIRA)

-1 overall
Vote Subsystem Runtime Comment
-1 patch 0m 6s LUCENE-8326 does not apply to master. Rebase required? Wrong Branch? See https://wiki.apache.org/lucene-java/HowToContribute#Contributing_your_work for help.
Subsystem Report/Notes
JIRA Issue LUCENE-8326
JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12926768/LUCENE-8326.patch
Console output https://builds.apache.org/job/PreCommit-LUCENE-Build/185/console
Powered by Apache Yetus 0.7.0 http://yetus.apache.org

This message was automatically generated.