Closed asfimport closed 18 years ago
cutting@apache.org (@cutting) (migrated from JIRA)
No, this has not yet been implemented. Dmitry posted a nearly-complete implementation over a year ago, but Lucene has changed a lot since then, so it would need some work to be updated.
Otis Gospodnetic (migrated from JIRA)
A few more relevant messages from Dmitry: http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=114748 http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=114861 http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=114862 http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=433778
Grant Ingersoll (@gsingers) (migrated from JIRA)
Created an attachment (id=10254) Contains the new files, the patch and implementation notes
Grant Ingersoll (@gsingers) (migrated from JIRA)
Attached is Dmitry's code updated for 1.3. Here are my notes on the implementation (which are also included in the attachment)
The patch is in the zip and is named termVector1.3Patch.txt and was generate using cvs diff -Nu at the root of the tree.
If there are any questions, I would be more than happy to help via the mailing list.
Notes on the re-implemenation of Dmitry's Term Vector enhancements for Lucene 1.3.
Please see http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene- dev@jakarta.apache.org&msgId=114748 for the original patch.
General Notes
I used Dmitry's code as a template by getting it working against 1.2 and then going through by hand and applying it against the HEAD. Thanks to Dmitry's great notes, it was relatively painless. All of the tests against HEAD pass.
Differences from 1.2 Version
The most significant change I had to make is that in the
TermFreqVector interface the getTermNumbers() method has been replaced by a
getTerms() method which
returns an array of Strings. These strings are the equivalent of Term.text()
and store the unique string
that has been indexed. While the numbering schema worked to save space it
presented a problem in 1.3 when
it comes to merging because the 1.3 code could support up to Long.MAX_LONG
positions (see TermEnum and
SegmentTermEnum) versus Integer.MAX_INTEGER in 1.2 (at least in my
understanding). This prevented me from
using the termMaps array technique used in 1.2 for remapping the term numbers
from the old segment to the new
segment. To solve this, we needed some globally unique identifier for a term.
For this, I use the term
text plus the field number that the terms came from (which is why there is a
new accessor methods
on TermFreqVector called get/setFieldNum).
The side benefit of this is that merging is much simpler, as we can just iterate over the readers and vectors add the terms from the old TermVector to the new TermVectorWriter, we don't have to do any remapping. The down side to this is the term vector files are going to take up more space on the disk.
I believe I have overcome the limitation that you can only retrieve term vectors on optimized indices. The SegmentsReader, which previously through runtime exceptions for the getTermVector methods now properly implements them.
Compatibility
Similar to Dmitry's, I believe the index files should be backward compatible.
Performance
Have not run thorough performance tests, but I did do the following runs, one with term vectors and one without term vectors:
Index Size: 12598 documents with 88362 terms. The documents in question are XML files where all of the TEXT was extracted and indexed.
Without TVs: Drive Space Used: 42 MB Time to index: 5 minutes, 30 seconds
With TVs: Drive Space Used: 71.3 MB Time to index: 6 minutes, 2 seconds
Your mileage may vary.
Limitations
Not sure what they are yet. I am sure there are places that could be optimized. The numbering scheme could probably be reinstituted by using some type of Paging Array or array of arrays scheme that allows you to store really large number of values.
FilterIndexReader throws an UnsupportedOperationException for the new Term Vector methods.
I did not test with compound files. Do not know if they are compatible.
Other limitations are probably those of omission. That is, are the new methods sufficient for doing what people need to do? I can think of a few:
Since only terms and frequencies are stored, something to quickly calculate the actual weight of the term as it was scored for the query. I looked into this, but, frankly, I am fairly confused by the whole Scorer/Similarity interactions, especially when it comes to nested queries.
Perhaps the Document object itself should have a method similar to those on IndexReader.
New File Notes
src/java/org/apache/lucene/index/SegmentTermVector.java Implementation of TermFreqVector and TermPositionVector.
src/java/org/apache/lucene/index/TermFreqVector.java Interface for describing a Document term vector. See notes above for what was changed from 1.2
src/java/org/apache/lucene/index/TermPositionVector.java No change from 1.2 version.
src/java/org/apache/lucene/index/TermVectorsReader.java Changed get methods to return TermFreqVector interface instead of explicit SegmentTermVector. Added getTermPositions method to retrieve TermPositionVector(s). Changed reading in slightly to match the writing of a the Term text instead of the term number.
src/java/org/apache/lucene/index/TermVectorsWriter.java
Added documentation
Changed the writing to write the term string instead of the term number
Would be nice if there was a way to turn on or off the writing of positional
information.
See the TODO comment.
src/test/org/apache/lucene/index/DocHelper.java Package local Class to help setup documents for testing.
src/test/org/apache/lucene/index/TestDocumentWriter.java New test class for the DocumentWriter object. Probably needs to be fleshed out more to fully test.
src/test/org/apache/lucene/index/TestFieldInfos.java Test for the new FieldInfos return values, etc.
src/test/org/apache/lucene/index/TestFieldsReader.java Basic test for FieldsReader. Needs to be expanded to fully test functionality.
src/test/org/apache/lucene/index/TestSegmentMerger.java Setups up two segments, including term vectors then merges them and asserts that items were properly merged.
src/test/org/apache/lucene/index/TestSegmentReader.java Various tests for the SegmentReader. Tests retrieving a document, deleting a document, retrieving field names and retrieving terms. Has a placeholder for retrieving norms, but I did not implement, as I didn't fully understand how norms worked.
src/test/org/apache/lucene/index/TestSegmentsReader.java Setups up a SegmentsReader made up of two Segments and does various tests on them. Needs to be filled in more completely.
src/test/org/apache/lucene/index/TestSegmentTermDocs.java Has positive and negative tests for the SegmentTermDocs.
src/test/org/apache/lucene/index/TestTermVectorsReader.java Writes out some term vectors and then asserts that they can be read back in
src/test/org/apache/lucene/index/TestTermVectorsWriter.java Writes out some term vectors and then asserts that the proper files were created w/ the proper information in them.
src/test/org/apache/lucene/search/TestTermVectors.java Searches over an indexed set of documents and then retrieves the term vectors for the documents. Also sets up a small collection of documents and maps containing term and frequency information and calculates that the term vectors are properly constructed. This is a fairly decent example of end to end use of the vectors.
Existing File Changes:
org/apache/lucene/analysis/PorterStemmer.java: Made public. Please, please, please apply this patch! I think several people have submitted this one and I vote for it as well! I use the implementation in other parts of my code and it is annoying to have to change it in my local copy every time there is a new release.
org/apache/lucene/document/Document.java Added a getNumFields() method that will return the number of fields that a document has.
org/apache/lucene/document/Field.java Same as 1.2 patch.
org/apache/lucene/index/DocumentWriter.java Same as 1.2 patch. Updated some formatting.
org/apache/lucene/index/FieldInfo.java Added constructor for indicating the term vector is stored.
org/apache/lucene/index/FieldInfos.java
Added support for term vector storage. Similar to 1.2 patch
The add methods now return a Map of <field name, field number> pairs.
org/apache/lucene/index/FieldsReader.java Added comment. Now constructs the Field object with the termVector information
org/apache/lucene/index/FilterIndexReader.java Formatted code. Added in implementation of Term Vector methods, but they are not implemented.
org/apache/lucene/index/IndexReader.java Same as 1.2 patch, plus added a getTermVectorReader method which returns the TermVectorReader for the IndexReader. Added new getIndexedFieldNames(boolean) methods which retrieve all indexed field names based on whether the field stores term vectors or not. Added a package local method named getFieldInfos which returns the field infos object for the reader. This is needed in merging. Formatted code.
org/apache/lucene/index/SegmentMerger.java Added comments and a mergeVectors() method that merges the terms in from the various readers into the new segment. Formatted code.
org/apache/lucene/index/SegmentReader.java Added new TV files to the list of segments. Implemented new IndexReader methods for TVS.
org/apache/lucene/index/SegmentTermDocs.java Formatted. Added in the isValid() method, but is commented out, as I am not sure it is needed. It was in 1.2 version.
org/apache/lucene/index/SegmentTermEnum.java Same as 1.2 patch. Formatted.
org/apache/lucene/index/SegmentTermPositions.java Same as 1.2 patch.
org/apache/lucene/index/SegmentsReader.java Added a fieldInfos variable that is the summation of all of the fieldInfos from the other segments. This is used to implement the getFieldInfos() method, but is probably not all that useful. Implements the new term vector methods.
org/apache/lucene/index/TermDocs.java Added isValid method per 1.2, but it is commented out as I am not sure we need it. Formatted code.
org/apache/lucene/index/TermEnum.java Same as 1.2 patch.
org/apache/lucene/index/TermInfosWriter.java Same as 1.2 patch.
org/apache/lucene/search/FilteredTermEnum.java Implements size() method, but throws UnsupportedOperationException.
org/apache/lucene/search/FuzzyTermEnum.java Implements termNumber() and isValid() but both throw UnsupportedOperationException.
org/apache/lucene/search/MultiSearcher.java Implements new count() methods as per 1.2 patch.
org/apache/lucene/search/RemoteSearchable.java Same as MultiSearcher.
org/apache/lucene/search/Searchable.java Added count() methods onto the interface.
org/apache/lucene/search/Searcher.java Added count() methods support.
org/apache/lucene/search/WildcardTermEnum.java Implements termNumber() and isValid() but both throw UnsupportedOperationException.
org/apache/lucene/index/TestFilterIndexReader.java Implements the necessary TV methods
org/apache/lucene/search/TestBasics.java Tests the count methods for the searcher.
cutting@apache.org (@cutting) (migrated from JIRA)
Wow!
I think the idea of removing the Term->int mapping is probably a good one, since it makes vectors available for all indexes, not just optimized ones, and that's really a requirement. It makes things bigger and slower (e.g., a vector dot-product will have to do string compares) but I think that's probably worth it.
Dmitry, others: what do you think of this approach?
Note that, since the vectors are sorted by term text, you can write them in a more compact manner by sharing string prefixes. See, for example, SegmentTermEnum.readTerm() for an example of how this can be done.
It would be best to include a format version number as the first four bytes of each file. I'm trying to add that as we introduce new files or change the format of existing files. This will make it much easier to compatibly evolve the file format.
An description of the new file formats will also be required before we make a 1.4 release. Can you draft something up about this?
I haven't actually applied the patch or tried to run this yet. One thing I note, in glancing at the code, is that it looks like you read the positions even when they're not asked for. (Or did I miss something.) It would be best if this could be avoided as it adds file i/o and increases the in-memory size of vectors. Lots of vector-based computations don't care about positions.
Thanks!
Grant Ingersoll (@gsingers) (migrated from JIRA)
Yeah, the term text was a trade-off, but the only other way I saw of doing it
was some type of large list object that allowed you to address an array using
longs (for merging).
I thought about doing the prefix string thing like in the main index file, but
wanted to keep it simple for the first go around.
I can provide file formats.
There is a TODO tag in the TermVectorWriter marking where we would need to handle the option of writing position information. Currently they are always written and always read in, per the original code. As above, I wanted to do the first pass as simple as possible.
Grant Ingersoll (@gsingers) (migrated from JIRA)
Below is the diff produced on the File Formats XML file located in xdocs, as promised. I trust it will be checked for accuracy. Let me know if there are any mistakes and I will fix them.
cvs diff -Nu fileformats.xml
@@
-224,7 +224,11 @@
multiplied into the score for hits on that field.
</p>
</li><li><p>Term Vectors. For each field in each document, the term vector
(sometimes called document vector) is stored. A term vector consists
of the term text, term frequency and term position.
</p>
</li>
<li><p>Deleted documents.
An optional file indicating which documents are
deleted.
</p>
@@
-804,9 +808,10 @@
</p>
<p>
Currently only the low-order bit is used of FieldBits is used. It is
one for
indexed fields, and zero for non-indexed fields.
without term vectors. </p>
<p>
@@
-1112,6 +1117,57 @@
</li>
</ol>
</subsection>
<subsection name="Term Vectors">
Term Vector support is an optional on a field by field basis.
It consists of 4
files.
<ol>
<li>
<p>The Document Index or .tvx file.</p>
<p>This contains, for each document, a pointer to the document data in the Document
(.tvd) file.
</p>
<p>DocumentIndex (.tvx) --> <DocumentPosition><sup>NumDocs</sup></p>
<p>DocumentPosition --> UInt64</p>
<p>This is used to find the position of the Document in the .tvd file.</p>
</li>
<li>
<p>The Document or .tvd file.</p>
<p>This contains, for each document, the number of fields, a list of the fields with
term vector info and finally a list of pointers to the field information in the .tvf
(Term Vector Fields) file.</p>
<p>
Document (.tvd) --> <NumFields, FieldNums, FieldPositions,><sup>NumDocs</sup>
</p>
<p>NumFields --> VInt</p>
<p>FieldNums --> <FieldNumDelta><sup>NumFields</sup></p>
<p>FieldNumDelta --> VInt</p>
<p>FieldPositions --> <FieldPosition><sup>NumFields</sup></p>
<p>FieldPosition --> VLong</p>
<p>The .tvd file is used to map out the fields that have term vectors stored and
where the field information is in the .tvf file.</p>
</li>
<li>
<p>The Field or .tvf file.</p>
<p>This file contains, for each field that has a term vector stored, a list of
the terms and their frequencies.</p>
<p>Field (.tvf) --> <NumTerms, NumDistinct, TermFreqs, TermPositionPointerDelta><sup>NumFields</sup></p>
<p>NumTerms --> VInt</p>
<p>NumDistinct -> VInt - Future Use</p>
<p>TermFreqs --> <TermText, TermFreq><sup>NumTerms</sup></p>
<p>TermText --> String</p>
<p>TermFreq --> VInt</p>
<p>TermPositionPointerDelta --> VLong</p>
<p></p>
</li>
<li>
<p>The Positions or .tvp file.</p>
<p>This contains, for each term in the Field and Document, the positional information for
each term in the document. </p>
<p>Positions (.tvp) --> <PositionDelta><sup>NumPositions</sup></p>
<p>PositionDelta --> VInt</p>
</li>
</ol> </subsection>
<subsection name="Deleted Documents">
Grant Ingersoll (@gsingers) (migrated from JIRA)
Created an attachment (id=10415) Term Vector support, part 2. See the notes.txt file in the attachment.
cutting@apache.org (@cutting) (migrated from JIRA)
Created an attachment (id=10443) vector patch file
cutting@apache.org (@cutting) (migrated from JIRA)
Phew! This was a fair bit of work to apply! I ended up making lots of changes, mostly removing stuff that wasn't used and/or shouldn't be public.
Can folks try applying this to the CVS head and seeing that everything still works? I'll check it in soon, if no one reports any problems.
cutting@apache.org (@cutting) (migrated from JIRA)
I committed this a few days ago.
Bruce Ritchie (migrated from JIRA)
Created an attachment (id=10522) Diff to fix a small documentation bug in the TermFreqVector class.
Bruce Ritchie (migrated from JIRA)
Created an attachment (id=10523) Another small patch to term vector documentation.
Grant Ingersoll (@gsingers) (migrated from JIRA)
Term Vector support now has optional support for storing Token.getPositionIncrement() and Token.startOffset() and Token.endOffset() information. Control of this is done through the standard Field creation methods. All options are backward compatible (position and offset information will not be stored by default). Added many new test cases to demonstrate functionality. There are two new files needed: SegmentTermPositionVector and TermVectorOffsetInfo. All tests pass as of 8/19/04 in the AM.
Attached should be 1 patch file plus a zip containing 2 new files.
What is this info good for?
Remember, the values stored are based on what values you set when running the Analyzer (i.e. Token.startOffset and Token.endOffset and Token.positionIncrement). These values are controlled by the application author and can vary by application.
Grant Ingersoll (@gsingers) (migrated from JIRA)
Created an attachment (id=12484) Patch file for new TermVector options
Grant Ingersoll (@gsingers) (migrated from JIRA)
Created an attachment (id=12485) Two new files needed for new TermVector position and offset support
Daniel Naber (migrated from JIRA)
Unfortunately your patch won't apply anymore, as the Field class has been modified. We know take enumerations instead of booleans and this patch should probably be adapted accordingly.
Grant Ingersoll (@gsingers) (migrated from JIRA)
How would you recommend I fix it? Is there a preferred way of adding these?
Seems like the Field class is going to be overwhelmed with inner classes if we
take this approach everytime we want to add some new feature to Field.
Daniel Naber (migrated from JIRA)
On the mailing list it has been suggested to add enumerations to Field.TermVector, e.g. maybe Field.TermVector.WITH_POSITIONS. Do you see any problem with the use of static inner classes?
Grant Ingersoll (@gsingers) (migrated from JIRA)
Created an attachment (id=12671) New Patch that incorporates the new Field.TermVector parameters
Grant Ingersoll (@gsingers) (migrated from JIRA)
The latest patch provides an upgrade to the previous Term Vector patch (dated 8/19/04 12:10) that added support for storing offset and position information. The attachment containing the 2 new files (dated 8/19/04 12:11) ARE still needed for this new patch.
Cheers, Grant
Christoph Goller (migrated from JIRA)
Hi Grant,
Thank you very much for this huge patch :-) I applied it with some changes. I would highly appreciate if you could review these changes briefly.
Christoph
Grant Ingersoll (@gsingers) (migrated from JIRA)
Attached is an implementation of TermPositionVector for queries. It simply extends QueryTermVector and provides position and offset information for a query.
I implemented it for the sake of completeness (there is a similar functionality on the Document side) and b/c I need it for my ApacheCon Lucene talk in December :-).
I also modified some items in QueryTermVector to make some private members protected and added some documentation.
Contents: patch.txt – The svn diff patch to QueryTermVector newFiles.tar – The new files for QueryTermPositionVector
Thanks, Grant
Moved from todo.xml: http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene- dev@jakarta.apache.org&msgNo=273 http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene- dev@jakarta.apache.org&msgNo=272
I don't know enough about the lucene internals to know if this was implemented in 1.3 so I'm adding it here as an open enhancement.
Migrated from LUCENE-95 by Eric Isakson, resolved May 27 2006 Environment:
Attachments: ASF.LICENSE.NOT.GRANTED--IndexReader.patch, ASF.LICENSE.NOT.GRANTED--patch-TermVectorPosOffset.txt (versions: 2), ASF.LICENSE.NOT.GRANTED--src.zip, ASF.LICENSE.NOT.GRANTED--TermFreqVector.patch, ASF.LICENSE.NOT.GRANTED--termVectorPatch1.3.zip, ASF.LICENSE.NOT.GRANTED--termVectorPatch-1.3-2.zip, ASF.LICENSE.NOT.GRANTED--vector.patch.gz, queryTermPositionVector.patch.tar.gz