apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.66k stars 1.03k forks source link

Term vectors options should not be configurable per-doc [LUCENE-9078] #10120

Open asfimport opened 4 years ago

asfimport commented 4 years ago

Make term vectors constant across the index. Remove the user ability to modify the term vector options per doc, IndexWriter allows this.

Once done, consider removing Fields, as the list of fields could be obtained from FieldInfos. See the discussion in #9089.


Migrated from LUCENE-9078 by Bruno Roustant (@bruno-roustant), 2 votes, updated Mar 15 2021 Linked issues:

asfimport commented 4 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

+1 I'm not sure this is the only blocker for the Fields removal though, we'd still need a class to hold term vectors for multiple fields.

asfimport commented 4 years ago

David Smiley (@dsmiley) (migrated from JIRA)

I think it's unfortunate that TVs are row-stored today.  If the query/scenario only needs a sub-set of TV fields then there's plenty of waste in decoding the others.  It's trappy to code against the current API wherein you may inadvertently re-load all TVs from disk when getting TVs of other fields without realizing the ramifications.  For example in the UnifiedHighlighter there's a little cache mechanism to ensure TVs are only fetched once – see TermVectorReusingLeafReader.  I know raising this is a distraction here; I could file an issue.  It's tangentially related because the class that replaces Fields for TV use-case would be fundamentally different if we get column-stored TVs.

asfimport commented 4 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

+1