apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.48k stars 985 forks source link

[PATCH] Term Vector support [LUCENE-95] #1173

Closed asfimport closed 18 years ago

asfimport commented 21 years ago

Moved from todo.xml: http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene- dev@jakarta.apache.org&msgNo=273 http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene- dev@jakarta.apache.org&msgNo=272

I don't know enough about the lucene internals to know if this was implemented in 1.3 so I'm adding it here as an open enhancement.


Migrated from LUCENE-95 by Eric Isakson, resolved May 27 2006 Environment:

Operating System: other
Platform: All

Attachments: ASF.LICENSE.NOT.GRANTED--IndexReader.patch, ASF.LICENSE.NOT.GRANTED--patch-TermVectorPosOffset.txt (versions: 2), ASF.LICENSE.NOT.GRANTED--src.zip, ASF.LICENSE.NOT.GRANTED--TermFreqVector.patch, ASF.LICENSE.NOT.GRANTED--termVectorPatch1.3.zip, ASF.LICENSE.NOT.GRANTED--termVectorPatch-1.3-2.zip, ASF.LICENSE.NOT.GRANTED--vector.patch.gz, queryTermPositionVector.patch.tar.gz

asfimport commented 21 years ago

cutting@apache.org (@cutting) (migrated from JIRA)

No, this has not yet been implemented. Dmitry posted a nearly-complete implementation over a year ago, but Lucene has changed a lot since then, so it would need some work to be updated.

asfimport commented 21 years ago

Otis Gospodnetic (migrated from JIRA)

A few more relevant messages from Dmitry: http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=114748 http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=114861 http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=114862 http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=433778

asfimport commented 20 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Created an attachment (id=10254) Contains the new files, the patch and implementation notes

asfimport commented 20 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Attached is Dmitry's code updated for 1.3. Here are my notes on the implementation (which are also included in the attachment)

The patch is in the zip and is named termVector1.3Patch.txt and was generate using cvs diff -Nu at the root of the tree.

If there are any questions, I would be more than happy to help via the mailing list.


Notes on the re-implemenation of Dmitry's Term Vector enhancements for Lucene 1.3.

Please see http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene- dev@jakarta.apache.org&msgId=114748 for the original patch.

General Notes


I used Dmitry's code as a template by getting it working against 1.2 and then going through by hand and applying it against the HEAD. Thanks to Dmitry's great notes, it was relatively painless. All of the tests against HEAD pass.

Differences from 1.2 Version


The most significant change I had to make is that in the TermFreqVector interface the getTermNumbers() method has been replaced by a getTerms() method which returns an array of Strings. These strings are the equivalent of Term.text() and store the unique string that has been indexed. While the numbering schema worked to save space it presented a problem in 1.3 when it comes to merging because the 1.3 code could support up to Long.MAX_LONG positions (see TermEnum and SegmentTermEnum) versus Integer.MAX_INTEGER in 1.2 (at least in my understanding). This prevented me from using the termMaps array technique used in 1.2 for remapping the term numbers from the old segment to the new segment. To solve this, we needed some globally unique identifier for a term.
For this, I use the term text plus the field number that the terms came from (which is why there is a new accessor methods on TermFreqVector called get/setFieldNum).

The side benefit of this is that merging is much simpler, as we can just iterate over the readers and vectors add the terms from the old TermVector to the new TermVectorWriter, we don't have to do any remapping. The down side to this is the term vector files are going to take up more space on the disk.

I believe I have overcome the limitation that you can only retrieve term vectors on optimized indices. The SegmentsReader, which previously through runtime exceptions for the getTermVector methods now properly implements them.

Compatibility


Similar to Dmitry's, I believe the index files should be backward compatible.

Performance


Have not run thorough performance tests, but I did do the following runs, one with term vectors and one without term vectors:

Index Size: 12598 documents with 88362 terms. The documents in question are XML files where all of the TEXT was extracted and indexed.

Without TVs: Drive Space Used: 42 MB Time to index: 5 minutes, 30 seconds

With TVs: Drive Space Used: 71.3 MB Time to index: 6 minutes, 2 seconds

Your mileage may vary.

Limitations


Not sure what they are yet. I am sure there are places that could be optimized. The numbering scheme could probably be reinstituted by using some type of Paging Array or array of arrays scheme that allows you to store really large number of values.

FilterIndexReader throws an UnsupportedOperationException for the new Term Vector methods.

I did not test with compound files. Do not know if they are compatible.

Other limitations are probably those of omission. That is, are the new methods sufficient for doing what people need to do? I can think of a few:

  1. Since only terms and frequencies are stored, something to quickly calculate the actual weight of the term as it was scored for the query. I looked into this, but, frankly, I am fairly confused by the whole Scorer/Similarity interactions, especially when it comes to nested queries.

  2. Perhaps the Document object itself should have a method similar to those on IndexReader.

New File Notes


src/java/org/apache/lucene/index/SegmentTermVector.java Implementation of TermFreqVector and TermPositionVector.

src/java/org/apache/lucene/index/TermFreqVector.java Interface for describing a Document term vector. See notes above for what was changed from 1.2

src/java/org/apache/lucene/index/TermPositionVector.java No change from 1.2 version.

src/java/org/apache/lucene/index/TermVectorsReader.java Changed get methods to return TermFreqVector interface instead of explicit SegmentTermVector. Added getTermPositions method to retrieve TermPositionVector(s). Changed reading in slightly to match the writing of a the Term text instead of the term number.

src/java/org/apache/lucene/index/TermVectorsWriter.java Added documentation Changed the writing to write the term string instead of the term number Would be nice if there was a way to turn on or off the writing of positional information.
See the TODO comment.

src/test/org/apache/lucene/index/DocHelper.java Package local Class to help setup documents for testing.

src/test/org/apache/lucene/index/TestDocumentWriter.java New test class for the DocumentWriter object. Probably needs to be fleshed out more to fully test.

src/test/org/apache/lucene/index/TestFieldInfos.java Test for the new FieldInfos return values, etc.

src/test/org/apache/lucene/index/TestFieldsReader.java Basic test for FieldsReader. Needs to be expanded to fully test functionality.

src/test/org/apache/lucene/index/TestSegmentMerger.java Setups up two segments, including term vectors then merges them and asserts that items were properly merged.

src/test/org/apache/lucene/index/TestSegmentReader.java Various tests for the SegmentReader. Tests retrieving a document, deleting a document, retrieving field names and retrieving terms. Has a placeholder for retrieving norms, but I did not implement, as I didn't fully understand how norms worked.

src/test/org/apache/lucene/index/TestSegmentsReader.java Setups up a SegmentsReader made up of two Segments and does various tests on them. Needs to be filled in more completely.

src/test/org/apache/lucene/index/TestSegmentTermDocs.java Has positive and negative tests for the SegmentTermDocs.

src/test/org/apache/lucene/index/TestTermVectorsReader.java Writes out some term vectors and then asserts that they can be read back in

src/test/org/apache/lucene/index/TestTermVectorsWriter.java Writes out some term vectors and then asserts that the proper files were created w/ the proper information in them.

src/test/org/apache/lucene/search/TestTermVectors.java Searches over an indexed set of documents and then retrieves the term vectors for the documents. Also sets up a small collection of documents and maps containing term and frequency information and calculates that the term vectors are properly constructed. This is a fairly decent example of end to end use of the vectors.

Existing File Changes:


org/apache/lucene/analysis/PorterStemmer.java: Made public. Please, please, please apply this patch! I think several people have submitted this one and I vote for it as well! I use the implementation in other parts of my code and it is annoying to have to change it in my local copy every time there is a new release.

org/apache/lucene/document/Document.java Added a getNumFields() method that will return the number of fields that a document has.

org/apache/lucene/document/Field.java Same as 1.2 patch.

org/apache/lucene/index/DocumentWriter.java Same as 1.2 patch. Updated some formatting.

org/apache/lucene/index/FieldInfo.java Added constructor for indicating the term vector is stored.

org/apache/lucene/index/FieldInfos.java Added support for term vector storage. Similar to 1.2 patch
The add methods now return a Map of <field name, field number> pairs.

org/apache/lucene/index/FieldsReader.java Added comment. Now constructs the Field object with the termVector information

org/apache/lucene/index/FilterIndexReader.java Formatted code. Added in implementation of Term Vector methods, but they are not implemented.

org/apache/lucene/index/IndexReader.java Same as 1.2 patch, plus added a getTermVectorReader method which returns the TermVectorReader for the IndexReader. Added new getIndexedFieldNames(boolean) methods which retrieve all indexed field names based on whether the field stores term vectors or not. Added a package local method named getFieldInfos which returns the field infos object for the reader. This is needed in merging. Formatted code.

org/apache/lucene/index/SegmentMerger.java Added comments and a mergeVectors() method that merges the terms in from the various readers into the new segment. Formatted code.

org/apache/lucene/index/SegmentReader.java Added new TV files to the list of segments. Implemented new IndexReader methods for TVS.

org/apache/lucene/index/SegmentTermDocs.java Formatted. Added in the isValid() method, but is commented out, as I am not sure it is needed. It was in 1.2 version.

org/apache/lucene/index/SegmentTermEnum.java Same as 1.2 patch. Formatted.

org/apache/lucene/index/SegmentTermPositions.java Same as 1.2 patch.

org/apache/lucene/index/SegmentsReader.java Added a fieldInfos variable that is the summation of all of the fieldInfos from the other segments. This is used to implement the getFieldInfos() method, but is probably not all that useful. Implements the new term vector methods.

org/apache/lucene/index/TermDocs.java Added isValid method per 1.2, but it is commented out as I am not sure we need it. Formatted code.

org/apache/lucene/index/TermEnum.java Same as 1.2 patch.

org/apache/lucene/index/TermInfosWriter.java Same as 1.2 patch.

org/apache/lucene/search/FilteredTermEnum.java Implements size() method, but throws UnsupportedOperationException.

org/apache/lucene/search/FuzzyTermEnum.java Implements termNumber() and isValid() but both throw UnsupportedOperationException.

org/apache/lucene/search/MultiSearcher.java Implements new count() methods as per 1.2 patch.

org/apache/lucene/search/RemoteSearchable.java Same as MultiSearcher.

org/apache/lucene/search/Searchable.java Added count() methods onto the interface.

org/apache/lucene/search/Searcher.java Added count() methods support.

org/apache/lucene/search/WildcardTermEnum.java Implements termNumber() and isValid() but both throw UnsupportedOperationException.

org/apache/lucene/index/TestFilterIndexReader.java Implements the necessary TV methods

org/apache/lucene/search/TestBasics.java Tests the count methods for the searcher.

asfimport commented 20 years ago

cutting@apache.org (@cutting) (migrated from JIRA)

Wow!

I think the idea of removing the Term->int mapping is probably a good one, since it makes vectors available for all indexes, not just optimized ones, and that's really a requirement. It makes things bigger and slower (e.g., a vector dot-product will have to do string compares) but I think that's probably worth it.

Dmitry, others: what do you think of this approach?

Note that, since the vectors are sorted by term text, you can write them in a more compact manner by sharing string prefixes. See, for example, SegmentTermEnum.readTerm() for an example of how this can be done.

It would be best to include a format version number as the first four bytes of each file. I'm trying to add that as we introduce new files or change the format of existing files. This will make it much easier to compatibly evolve the file format.

An description of the new file formats will also be required before we make a 1.4 release. Can you draft something up about this?

I haven't actually applied the patch or tried to run this yet. One thing I note, in glancing at the code, is that it looks like you read the positions even when they're not asked for. (Or did I miss something.) It would be best if this could be avoided as it adds file i/o and increases the in-memory size of vectors. Lots of vector-based computations don't care about positions.

Thanks!

asfimport commented 20 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Yeah, the term text was a trade-off, but the only other way I saw of doing it was some type of large list object that allowed you to address an array using longs (for merging).
I thought about doing the prefix string thing like in the main index file, but wanted to keep it simple for the first go around.

I can provide file formats.

There is a TODO tag in the TermVectorWriter marking where we would need to handle the option of writing position information. Currently they are always written and always read in, per the original code. As above, I wanted to do the first pass as simple as possible.

asfimport commented 20 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Below is the diff produced on the File Formats XML file located in xdocs, as promised. I trust it will be checked for accuracy. Let me know if there are any mistakes and I will fix them.

cvs diff -Nu fileformats.xml

Index: fileformats.xml

RCS file: /home/cvspublic/jakarta-lucene/xdocs/fileformats.xml,v retrieving revision 1.6 diff -u -r1.6 fileformats.xml — fileformats.xml 13 Oct 2003 13:53:08 -0000 1.6 +++ fileformats.xml 9 Feb 2004 16:08:57 -0000 @@ -224,7 +224,11 @@ multiplied into the score for hits on that field. </p> </li>

asfimport commented 20 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Created an attachment (id=10415) Term Vector support, part 2. See the notes.txt file in the attachment.

asfimport commented 20 years ago

cutting@apache.org (@cutting) (migrated from JIRA)

Created an attachment (id=10443) vector patch file

asfimport commented 20 years ago

cutting@apache.org (@cutting) (migrated from JIRA)

Phew! This was a fair bit of work to apply! I ended up making lots of changes, mostly removing stuff that wasn't used and/or shouldn't be public.

Can folks try applying this to the CVS head and seeing that everything still works? I'll check it in soon, if no one reports any problems.

asfimport commented 20 years ago

cutting@apache.org (@cutting) (migrated from JIRA)

I committed this a few days ago.

asfimport commented 20 years ago

Bruce Ritchie (migrated from JIRA)

Created an attachment (id=10522) Diff to fix a small documentation bug in the TermFreqVector class.

asfimport commented 20 years ago

Bruce Ritchie (migrated from JIRA)

Created an attachment (id=10523) Another small patch to term vector documentation.

asfimport commented 19 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Term Vector support now has optional support for storing Token.getPositionIncrement() and Token.startOffset() and Token.endOffset() information. Control of this is done through the standard Field creation methods. All options are backward compatible (position and offset information will not be stored by default). Added many new test cases to demonstrate functionality. There are two new files needed: SegmentTermPositionVector and TermVectorOffsetInfo. All tests pass as of 8/19/04 in the AM.

Attached should be 1 patch file plus a zip containing 2 new files.

What is this info good for?

  1. I think the highlighter could use this info (offset) instead of reparsing every document at runtime
  2. Many IR algorithms need character position, etc.
  3. Others??

Remember, the values stored are based on what values you set when running the Analyzer (i.e. Token.startOffset and Token.endOffset and Token.positionIncrement). These values are controlled by the application author and can vary by application.

asfimport commented 19 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Created an attachment (id=12484) Patch file for new TermVector options

asfimport commented 19 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Created an attachment (id=12485) Two new files needed for new TermVector position and offset support

asfimport commented 19 years ago

Daniel Naber (migrated from JIRA)

Unfortunately your patch won't apply anymore, as the Field class has been modified. We know take enumerations instead of booleans and this patch should probably be adapted accordingly.

asfimport commented 19 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

How would you recommend I fix it? Is there a preferred way of adding these?
Seems like the Field class is going to be overwhelmed with inner classes if we take this approach everytime we want to add some new feature to Field.

asfimport commented 19 years ago

Daniel Naber (migrated from JIRA)

On the mailing list it has been suggested to add enumerations to Field.TermVector, e.g. maybe Field.TermVector.WITH_POSITIONS. Do you see any problem with the use of static inner classes?

asfimport commented 19 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Created an attachment (id=12671) New Patch that incorporates the new Field.TermVector parameters

asfimport commented 19 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

The latest patch provides an upgrade to the previous Term Vector patch (dated 8/19/04 12:10) that added support for storing offset and position information. The attachment containing the 2 new files (dated 8/19/04 12:11) ARE still needed for this new patch.

Cheers, Grant

asfimport commented 19 years ago

Christoph Goller (migrated from JIRA)

Hi Grant,

Thank you very much for this huge patch :-) I applied it with some changes. I would highly appreciate if you could review these changes briefly.

Christoph

asfimport commented 18 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Attached is an implementation of TermPositionVector for queries. It simply extends QueryTermVector and provides position and offset information for a query.

I implemented it for the sake of completeness (there is a similar functionality on the Document side) and b/c I need it for my ApacheCon Lucene talk in December :-).

I also modified some items in QueryTermVector to make some private members protected and added some documentation.

Contents: patch.txt – The svn diff patch to QueryTermVector newFiles.tar – The new files for QueryTermPositionVector

Thanks, Grant