Multi-token post-analysis DocValues [LUCENE-10023]

apache / lucene

Apache Lucene open-source search software

https://lucene.apache.org/

Apache License 2.0

2.6k stars 1.01k forks source link

Multi-token post-analysis DocValues [LUCENE-10023] #11062

Open asfimport opened 3 years ago

asfimport commented 3 years ago

The single-token case for post-analysis DocValues is accounted for by Analyzer.normalize(...) (and formerly MultiTermAwareComponent); but there are cases where it would be desirable to have post-analysis DocValues based on multi-token fields.

The main use cases that I can think of are variants of faceting/terms aggregation. I understand that this could be viewed as "trappy" for the naive "Moby Dick word cloud" case; but:

I think this can be supported fairly cleanly in Lucene
Explicit user configuration of this option would help prevent people shooting themselves in the foot
The current situation is arguably "trappy" as well; it just offloads the trappiness onto Lucene-external workarounds for systems/users that want to support this kind of behavior
Integrating this functionality directly in Lucene would afford consistency guarantees that present opportunities for future optimizations (e.g., shared Terms dictionary between indexed terms and DocValues).

This issue proposes adding support for multi-token post-analysis DocValues directly to IndexingChain. The initial proposal involves extending the API to include IndexableFieldType.tokenDocValuesType() (in addition to existing IndexableFieldType.docValuesType()).

Migrated from LUCENE-10023 by Michael Gibney (@magibney), updated Oct 13 2021 Pull requests: https://github.com/apache/lucene/pull/208

asfimport commented 3 years ago

Michael Gibney (@magibney) (migrated from JIRA)

In contrast to "naive word cloud" faceting, the more compelling use cases for multi-token post-analysis DocValues tend to be specialized cases, with inherent limitations on the number of tokens. A couple of comments on related issues mention path_tokenizer in Elasticsearch (see this comment, and this comment). My own use case has to do with fields that are in a sense single-valued, but with TokenFilters that may produce expanded "synonym"-style mappings (really broader/narrower/related/hierarchical entities).

And fwiw, I would argue that there are legitimate use cases even for the "naive word cloud" approach – text corpus analytics, etc.

I realize that it would be possible to do this work external to Lucene; but to me it felt cleanest to add it here, at least to have something concrete for seeding discussion. The PR initially includes only a trivial test demonstrating the new behavior; more tests can be added if there's a decision to further pursue this approach.

asfimport commented 3 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I actually think it is useful for the user to do this themselves: pushing the trappiness onto them is a good thing? For example they can easily add default limits and such. I'm especially worried about search servers that would expose this blindly without such checks, and using configured analyzers that wouldn't have limit-filters and so on.

The current patch really just shoves sugar into indexwriter that isn't necessary IMO (and some more complexity in fieldtype etc). It isn't any more efficient than the user consuming the TS themselves and adding the field.

asfimport commented 3 years ago

Michael Gibney (@magibney) (migrated from JIRA)

pushing the trappiness onto them is a good thing

I think that's a reasonable perspective. That said, I was motivated to add the functionality here because I see the same questions being raised in Elasticsearch and Solr forums (and presumably it'd be useful in other contexts as well); and it's a discrete enough change that I was looking to factor out the common functionality to the appropriate place in the software stack.

I'm especially worried about search servers that would expose this blindly without such checks, and using configured analyzers that wouldn't have limit-filters and so on.

I could see this concern leading one way or the other: such a default limit (perhaps configurable via FieldType?) could be enforced with negligible overhead directly in IndexingChain, protecting users against even a buggy TokenStream. I definitely see how the careless use of this feature could be problematic; but even without built-in limits, assuming this feature is not enabled by default, a user would have to explicitly enable it – which they would presumably only do in response to a specific need for the functionality this feature supports.

The current patch really just shoves sugar into indexwriter [....] It isn't any more efficient than the user consuming the TS themselves and adding the field

Wouldn't it be more efficient though, in at least some cases? With an "external" approach such as you're suggesting, a tokenized and indexed field would have to run analysis twice: once outside of IndexingChain in order to generate (and buffer) explicit SortedSetDocValuesFields, and again within IndexingChain to generate indexed tokens (with associated token attributes). In contrast, the current PR only runs analysis once, and avoids buffering BytesRefs (i.e., via deepCopy() – also avoids creating disposable one-off SortedSetDocValuesField objects), and instead sends BytesRefs directly to the docValuesWriter.

Aside from performance, the approach taken by this PR is more than sugar in the sense that an IndexingChain-internal approach can enforce consistency between indexed terms and docValues terms, which could be useful in a number of ways (e.g., reliable behavior in applications that assume 1:1 correspondence between indexed terms and docValues, and potential future optimizations such as a shared terms dictionary).

asfimport commented 3 years ago

Robert Muir (@rmuir) (migrated from JIRA)

In contrast, the current PR only runs analysis once, and avoids buffering BytesRefs (i.e., via deepCopy() – also avoids creating disposable one-off SortedSetDocValuesField objects), and instead sends BytesRefs directly to the docValuesWriter.

I'm not seeing this looking at your patch. I see duplicate inversion: e.g. tokenstream consumed twice.

asfimport commented 3 years ago

Robert Muir (@rmuir) (migrated from JIRA)

In contrast, the current PR only runs analysis once, and avoids buffering BytesRefs (i.e., via deepCopy() – also avoids creating disposable one-off SortedSetDocValuesField objects), and instead sends BytesRefs directly to the docValuesWriter.

I don't view this as an advantage. In fact it seems to only matter for exactly the trappy cases that I am concerned about. If docs are so large, and you dont have enough ram to index a single doc into SortedSetDocValues itself, well, that says everything :)

Also it isn't clear the current PR works in all cases (e.g. indexed=false, but Field passed with a Tokenstream)... There are a lot of corner cases, but things use these features (e.g. solr preanalyzedfield). I am concerned about the increase in complexity with really no benefit, except making it easier for someone to shoot themselves in the foot.

asfimport commented 3 years ago

Michael Gibney (@magibney) (migrated from JIRA)

I'm not seeing this looking at your patch. I see duplicate inversion: e.g. tokenstream consumed twice.

The usual case is within the main "while loop" of IndexingChain.invert(int, IndexableField, boolean). The IndexingChain.tokenizedDocValuesOnly(int, IndexableField) method is a mutually exclusive simplification of the usual case (largely redundant, as called out in the javadocs for that method). Unless I'm missing something, the code indeed should only consume the TokenStream once for any given field.

it isn't clear the current PR works in all cases (e.g. indexed=false, but Field passed with a Tokenstream)

There's only one place in IndexingChain where the stream.incrementToken() is called in a loop – i.e., where the TokenStream is consumed; since that's exactly where the PR hooks in (via TermToBytesRefAttribute) to generate post-analysis DocValues, I'd be surprised if it were missing edge cases. To be sure, it wouldn't be the first time I've been surprised :-) ... but it'd be an easy enough thing check if/when making tests more robust.

I don't view this as an advantage. In fact it seems to only matter for exactly the trappy cases that I am concerned about. If docs are so large, and you dont have enough ram to index a single doc into SortedSetDocValues itself, well, that says everything :-)

Fair point :); and accordingly, given that the "trappy" use case isn't close to my heart, I don't intend to argue this point to the hilt. I do however wonder in principle about the "text corpus analytics" use case ... intentional "misuse" being otherwise known as simply: "use". (FWIW, equivalent functionality is available in Solr, but only via "uninverting" - a rather extreme baseline for evaluating "trappiness"!).

To my mind, a lot hinges on whether this PR really does miss edge cases. In its current state (for the moment assuming it properly handles edge cases), the PR feels pretty clean and straightforward to me. The change is quite small – both in terms of "lines of code" and also, I'd argue, it is kind of "sugar" in the sense that it doesn't actually introduce fundamental changes – it simply adds first-class support (and some efficiency gains) for a particular use case.

I guess the essence of the question is whether the likelihood of accidental misuse (and severity of associated consequences) warrant the exclusion of a fairly superficial change that would add efficient first-class support for a number of legitimate use cases. (This in addition to weighing the complexity of the change, which to me does not seem inordinately great).

asfimport commented 3 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I guess the essence of the question is whether the likelihood of accidental misuse (and severity of associated consequences) warrant the exclusion of a fairly superficial change that would add efficient first-class support for a number of legitimate use cases. (This in addition to weighing the complexity of the change, which to me does not seem inordinately great).

This is the wrong way to think about it. With a library like this, we need to think about it the other way.

Why should this be in indexwriter? It isnt any more efficient than users doing it themselves, and it is a 1% case. These 1% cases kill us, they all stack up and make it ultimately impossible to optimize the 99% case.

We should make the common cases easy and the rare/corner cases possible, and that is it.

Please, don't look at your PR and think that the code will look the same as that 5 years later, it won't. The simplest changes always double and triple in size after refactorings, corner-cases, etc.

For example in this case, I imagine users hitting problems that simple limits would solve. Suddenly now we have more complexity as various limits are then added to indexwriter for it, all of which could be avoided if the change was just kept out of indexwriter in the first place. That way, users themselves doing expert shit could add their own logic (e.g. only put top-N most common terms per doc in the thing, limit to N terms, both, or other more complicated things).

Personally, I don't agree with this change, as it only buys us future maintenance burden. I don't see it being more efficient than the user doing it themselves.

asfimport commented 3 years ago

Michael Gibney (@magibney) (migrated from JIRA)

This is the wrong way to think about it. With a library like this, we need to think about it the other way.

I'm on board with that. I should have phrased the "essential question" differently – I didn't mean to imply that the burden of proof was on the "against" side – it's of course the opposite.

I did my best to explain the arguments in favor of inclusion, and don't have anything to add on that side. I do think it's likely more efficient than users doing this themselves ("more efficient by how much" is an open, relevant question, and indeed the benefit would be more pronounced for the cases to which you're least sympathetic).

It's a fair point about long-term complexity/scope creep (particularly in such a central component). Every change carries that risk, but on the spectrum of risk I honestly think this particular change comes in relatively low: It hooks in cleanly (reading values from TermToBytesRefAttribute at exactly the same point as indexing does, writing directly to docValuesWriter); and once the docValues are written, they're "just docValues" – there are no additional lower-level restrictions or assumptions to make/violate.

I imagine users hitting problems that simple limits would [not] solve

Yes, there are a number of nuanced limits that could be useful (and that could/should be configured via TokenFilters); but the idea to add a simple limit directly in IndexingChain isn't as a convenience, but strictly as a failsafe to prevent users shooting themselves in the foot. A simple limit is sufficient for this purpose, and with an open mind, I honestly don't anticipate a strong temptation to add more nuanced limits directly to IndexingChain.

From my perspective, the complexity/negative consequences I could anticipate from this PR are: at the index level, "docValues are docValues" – no distinction between pre- and post-analysis. So if checking consistency between FieldType and index on-disk, you'd have to check against FieldType.docValuesType() and FieldType.tokenDocValuesType(), and perhaps check that these two are mutually exclusive for a given FieldType. The other potential issue would be if there's a chance that calling indexDocValue(...) from within the IndexingChain.invert(int, IndexableField, boolean) would for some reason be disallowed.

I'm not seeking to needlessly prolong this discussion, just continuing to think through these issues. Robert, I recognize I'm not going to end up convincing you ultimately, but I sincerely appreciate the feedback/consideration.

asfimport commented 3 years ago

Robert Muir (@rmuir) (migrated from JIRA)

A big part of my concern is adding this code directly to indexwriter (which seems unnecessary as I mentioned before).

Since the user can do this themselves, an alternative for any provided "sugar" would be to add it as a regular Field (or similar) to sandbox/ module.

My concerns about it being trappy/explosive wouldn't go away, I do feel like it is a loaded gun without safety mechanisms, but maybe we could figure that out in the sandbox. And we wouldn't be committing to having this stuff in indexwriter.

asfimport commented 3 years ago

Michael Gibney (@magibney) (migrated from JIRA)

That would make sense. Really the main difference in a sandbox-based impl would be performance (double-consuming TokenStream and extra buffering of token BytesRefs). Having a concrete sandbox impl available would give a baseline for evaluating any performance difference, and would also address the desire to add this functionality in a place where it would be factored out and accessible to Elasticsearch, Solr, etc...

My only hesitation about the sandbox approach is that if there's not even a remote thought of ever considering/evaluating the performance gain that would come from integrating this in IndexingChain, and entertaining the legitimacy of the "text corpus analytics"/many-token use case (with trappiness somehow mitigated), then the sandbox change would be exclusively sugar.

This is neither here nor there, and not an argument against the sandbox approach, but tbh it likely wouldn't have occurred to me to file a Lucene issue for this if the change were strictly about sugar, with no performance aspect. That said, I think it still might be worth pursuing a sandbox-based approach, particularly if:

there's any potential of revisiting closer integration in IndexingChain, for performance reasons, or
there's interest in leveraging this from "other-than-Solr" (e.g., Elasticsearch, etc. ... I'm approaching this from the Solr side, so could just as well implement it there, in a custom Solr FieldType, I think).

I probably won't move immediately on to implementing a sandbox-based approach; I'd be interested if anyone feels inclined to weigh in on whether they'd find such an approach useful.

asfimport commented 3 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

if there's interest in leveraging this from "other-than-Solr" (e.g., Elasticsearch, etc. ...

We had considered supporting something like that a few years ago before changing our minds. There were really only two use-cases for this, which were building tag clouds of analyzed tokens - which is a much less frequent need than doing the same over non-analyzed strings, e.g. tags - and performing analysis of significant terms. It didn't feel right to increase the API surface area of text fields with a new option to index doc values for a tiny minority of users so we decided against adding this option. We have since moved to term vectors instead for the few use-cases that need something like this, with a recommendation of only running such analysis on top hits.

asfimport commented 3 years ago

Michael Gibney (@magibney) (migrated from JIRA)

Thanks for the Elasticsearch perspective, Adrien! (and I agree that you've identified the two relevant use cases).

I'm wondering whether you mean "tag clouds of analyzed tokens" to refer to naive terms aggregation (simple count) on text fields? As noted in the docs, terms aggregation on text fields remains dependent on enabling fielddata (uninverting). Alternatively of course, I gather that the single-token "tag cloud of analyzed tokens" case is supported via normalize() (formerly MultTermComponentAware) – though this notably doesn't support synonym-style expansion of otherwise single-token analysis.

wrt the "significant terms analysis" case: I infer direct support for the term vectors approach in the significant text aggregation (as distinct from the significant terms aggregation)?

The Elasticsearch docs for "significant terms" aggregation state that DocValues are not supported as sources of term data. Is this still the case? FWIW, from the Solr perspective, "relatedness" (analogous to Elasticsearch "significant terms") is now computed over DocValues, and could be viewed at a lower level as a composite/special case of naive "terms" aggregations over different domains (foreground, background; an approach that yields particular performance benefits for high-cardinality – e.g., full-text – fields). I view this as being a significant point to raise, because the value of naive "terms" aggregation over full text is dubious, relative to the value of doc-frequency-normalized "significant terms" aggregation. If from the Elasticsearch perspective full-domain "significant terms" aggregation isn't supported/recommended, there'd be no reason to prefer DocValues as opposed to term vectors. This is not the case from the Solr perspective.

asfimport commented 3 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I'm wondering whether you mean "tag clouds of analyzed tokens" to refer to naive terms aggregation (simple count) on text fields?

Yes.

wrt the "significant terms analysis" case: I infer direct support for the term vectors approach in the significant text aggregation (as distinct from the significant terms aggregation)?

Yes indeed.

DocValues are not supported as sources of term data. Is this still the case?

I need to check but I think that this sentence is a bit misleading and means that we don't support this aggregation on fields that enable only doc values, ie. we require the field to be indexed to have access to term frequencies.

f from the Elasticsearch perspective full-domain "significant terms" aggregation isn't supported/recommended

It's no longer recommended for text fields anymore indeed, we point our users to significant_text instead and we only keep significant_terms for keyword fields like tags.

asfimport commented 3 years ago

Michael Gibney (@magibney) (migrated from JIRA)

this sentence is a bit misleading and means that we don't support this aggregation on fields that enable only doc values, ie. we require the field to be indexed to have access to term frequencies.

Ah, ok! For "significant_terms" it looks like the "subset" (foreground set) count is calculated via docValues API, but the field must be indexed in order to calculate "superset" (background set) count, via one of:

accessing static doc freq (for terms with no backgroundFilter), or
calculating the intersection of backgroundFilter with each candidate bucket value (either via FilterableTermsEnum or BooleanQuery).

In any case, iiuc this approach is problematic for "full text" mainly because "full text" fields tend to be high-cardinality. Put another way: "significant_terms" over a hypothetical "full text" field with post-analysis DocValues enabled would be no less performant than over a DocValues-enabled keyword field of equivalent cardinality (or perhaps slightly less performant due to higher mean per-term docFreq). This is not a revolutionary observation ... but it's relevant because an entirely DocValues-driven method of calculating "relatedness"/"significant_terms" (as is the case now in Solr) should scale well enough wrt field cardinality that full-domain "significant_terms" would become viable over "full text" fields. In this context, there is a practical reason to prefer multi-token post-analysis DocValues for "full text" fields, as opposed to a restricted-domain, term-vectors-based approach.

I'm mainly mentioning this because I agree that in the absence of an purely DocValues-driven approach to calculating "relatedness"/"significant_terms", the practical argument in favor of multi-token post-analysis DocValues for "significant_terms" over full text would indeed be weak; so it's worth noting that such a purely DocValues-driven approach has in fact been implemented.

asfimport commented 3 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

To be clear I'm not opposed to doing analysis of doc-values fields, I just think that it's not core-enough to warrant changes to IndexingChain as opposed to Robert's approach of doing it on top of IndexingChain, e.g. with a custor Field.

asfimport commented 3 years ago

Michael Gibney (@magibney) (migrated from JIRA)

Thanks for the clarification, Adrien. I understand, and I was pointing out the availability of a scalable pure-docValues "significant terms" implementation because in the absence of such an implementation I agree it would be difficult to argue that this change is "core-enough". It seems there's consensus that this change is not core-enough even in the context of scalable full-domain pure-docValues "significant terms", so I'll just accept whatever performance hit is associated with double-consuming the token stream and buffering tokens.

To wrap things up on this issue, from my perspective: I'm now inclined to propose this double-TokenStream consumption/buffering in Solr (as opposed to in Lucene sandbox), given that there doesn't seem to be interest at the moment on the Elasticsearch side, and also because there's probably value in putting the performance hit higher up, closer to the application code (for better transparency).

Some final thoughts: the main use cases supported by this would be:

faceting/terms aggregation over analyzed (potentially multi-token, e.g., synonym- or cross-reference-expanded) "tags"; (over analyzed full-text would also incidentally be supported; the main benefit for the full-text case would be to avoid the need to "special-case" full-text terms aggregation, from the user's perspective – but the full-text use case would be of limited practical utility, compared to ....)
full-domain "relatedness"/"significant_terms" aggregations over full-text fields (text corpus analytics, etc.). Note: the viability of this use case depends on an implementation that does not require "pivoting" to per-term "inverted" index lookups.

The main benefits of an approach integrated in IndexChain (as opposed to a custom FieldType that double-consumes the TokenStream and buffers tokens as "standard" docValues for indexing) are:

index-time performance (avoid double-consuming TokenStream and extra token buffering)
(potential): Lucene-internal (guaranteed) consistency between indexed terms and docValues terms, with potential optimizations such as shared terms dictionary, and stronger guarantees about the appropriateness of access patterns that rely on consistency/compatibility between indexed terms and docValues (e.g., "refinement", etc.).

Thanks again Robert and Adrien for the feedback!