Open asfimport opened 8 years ago
Roberto Cornacchia (migrated from JIRA)
I've been pointed at this bit of documentation for IndexReader.document(int dicID)
:
NOTE: only the content of a field is returned, if that field was stored during indexing. Metadata like boost, omitNorm, IndexOptions, tokenized, etc., are not preserved.
This explains what I've reported. But I find it hard not to consider this a design flaw.
If I take the retrieved document and store it into a new index, I would expect this document to be the same as the one stored in the first index. It doesn't matter where it's stored. Those properties are defined for the fields of that document, not for a particular index.
However, if I now try to retrieve that same document from the second index (on the exact match with its isbn), it won't be found, because isbn
has been tokenized. This is surely not intended, is it?
Roberto Cornacchia (migrated from JIRA)
Perhaps I can reformulate this more concisely as:
Why, in DocumentStoredFieldVisitor
, StringField
is arbitrarily converted into TextField
? What is the point of having them as different classes if they are swapped under the hood?
This looks like a quick patch to the fact that no textField()
method is present in StoredFieldVisitor
.
public class DocumentStoredFieldVisitor extends StoredFieldVisitor {
...
`@Override`
public void stringField(FieldInfo fieldInfo, byte[] value) throws IOException {
final FieldType ft = new FieldType(TextField.TYPE_STORED);
ft.setStoreTermVectors(fieldInfo.hasVectors());
ft.setOmitNorms(fieldInfo.omitsNorms());
ft.setIndexOptions(fieldInfo.getIndexOptions());
doc.add(new Field(fieldInfo.name, new String(value, StandardCharsets.UTF_8), ft));
}
Michael McCandless (@mikemccand) (migrated from JIRA)
This is indeed irritating, but it is a long standing issue in Lucene: it does not in fact store all attributes (such as the "was this field tokenized?" boolean), which means on loading the document it "guesses" (incorrectly in your case).
We tried to fix this before, in #4385, which introduced a different document class (StoredDocument
) at search time to make it strongly typed so that it was clear Lucene would not store these attributes.
But that proved problematic and we eventually reverted the change in #8028 and now we are back in the trappy state.
@mikemccand
This is indeed irritating, but it is a long standing issue in Lucene
It should still match against the original untokenized Term right, when queried with a TermQuery? ie it should suddenly not start matching against a query that assumes that the index is tokenized?
So the StringField data should still match against a TermQuery that uses the untokenized word right.. I guess @woj-tek was not seeing that..
So the StringField data should still match against a TermQuery that uses the untokenized word right.. I guess @woj-tek was not seeing that..
Hmm... I think the issue in my case is weirder (I assume you mentioned me after JAMES-4046?).
In that case during my investigation:
1) I create document:
var document = new Document();
document.add(new StringField(ID_FIELD, "flags-1-1", Field.Store.YES));
(it yields: Document<stored,indexed,omitNorms,indexOptions=DOCS<id:flags-1-1>
)
2) find document using query on different fields (yields the same document) which gives now (tokenized):
Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1>
3) Use the ID field for update using new Term(ID_FIELD, document.get(ID_FIELD));
- document is not updated correctly as it doesn't seem to be matched (it has tokenized
when retrieved from the repository - not sure if attributes also count towards matching of a field)
This code:
Produces:
The
StringField
fieldisbn
is not tokenized, as correctly reported by the first output, which happens right after closing the writer. However, it becomes tokenized when the index is re-opened with a new reader.Migrated from LUCENE-7171 by Roberto Cornacchia, updated May 23 2016