apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.6k stars 1.01k forks source link

Field norm modifier (CLI tool) [LUCENE-741] #1816

Open asfimport opened 17 years ago

asfimport commented 17 years ago

I took Chris' LengthNormModifier (contrib/misc) and modified it slightly, to allow us to set fake norms on an existing fields, effectively making it equivalent to Field.Index.NO_NORMS.

This is related to #1526 (NO_NORMS patch) and #1574 (LengthNormModifier contrib from Chris).


Migrated from LUCENE-741 by Otis Gospodnetic (@otisg), updated Jan 12 2007 Attachments: for.nrm.patch, LUCENE-741.patch (versions: 2)

asfimport commented 17 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

Committed. I'll also remove the old version of this code (+ its unit test), the one that still lives in contrib/miscellaneous/src/java/org/apache/lucene/misc/ .

asfimport commented 17 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

The norm-removing functionality was bogus - it simply "normalized the norms" to be 1 for the given field, but did not completely remove norms for a field, and did not flip the omitNorms bit for the given field, so it was never a true NO_NORMS field.

I'll upload a new patch that does this, but it does it only for Lucene 2.0.0 and Lucene 2.1-dev before the new .nrm changes from #1831 were committed.

asfimport commented 17 years ago

Doron Cohen (migrated from JIRA)

I was looking at what it would take to make this work with .nrm file as well. I expected there will be a test that fails currently, but there is none. So I looked into the tests and the implementation and have a few questions:

(1) under contrib, FieldNormModifier and LengthNormModifier seem quite similar, right? The first one sets with:

(2) TestFieldNormModifier.testFieldWithNoNorm() calls resetNorms() for a field that does not exist. Some work is done by the modifier to collect the term frequencies, and then reader.setNorm is called but it does nothing, because there are no norms. And indeed the test verifies that there are still no norms for this field. Confusing I think. For some reason I assumed that calling resetNorms() for a field that has none, would implicitly set omitNorms to false for that field and compute it - the inverse of killNorms(). Since this is not the case, perhaps resetNorms should throw an exception in this case?

(3) I would feel safer about this feature if the test was more strict - something like TestNorms - have several fields, modify some, each in a unique way, remove some others, then at the end verify that all the values of each field norms are exactly as expected.

(4) For killNorms to work, you can first revert the index to not use .nrm, and then "kill" as before. The code knows to read .fN files, for both backwards compatibility, and for reading segments created be DocumentWriter. The following steps will do this:

(5) It would have been more efficient to optimize (and remove the .nrm file) once in the application, so perhaps modify the public API to take an array of fields and operate on all?

asfimport commented 17 years ago

Doron Cohen (migrated from JIRA)

Attached for.nrm.patch was very noisy - so I replaced it with one created with svn diff -x --ignore-eol-style contrib/miscellaneous It is relative to trunk.

A test is added to TestFieldNormModifier.java - testModifiedNormValuesCombinedWithKill - that verifies exactly what are the values of norms after modification.

FieldNormModifier modified to handle .nrm file as outlined above.