apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.67k stars 1.03k forks source link

norms are inconsistent with postings [LUCENE-10271] #11307

Open asfimport opened 2 years ago

asfimport commented 2 years ago

See discussion: https://github.com/apache/lucene/pull/477

We should not write any norms when there are no postings. And the number of docs which have norms for a field should exactly match the number of docs with postings for the field (getDocCount).

We shouldn't write crazy norms like 0 for docs without tokens, nor should we have crazy query behavior to try to compensate for elasticsearch users that are using the wrong analysis chain.


Migrated from LUCENE-10271 by Robert Muir (@rmuir) Attachments: LUCENE-10271_prototype.patch

asfimport commented 2 years ago

Robert Muir (@rmuir) (migrated from JIRA)

The purpose of the norms is for scoring on the field, if you have no tokens, you need no norms. It needs to be 100% consistent with the index stats. CheckIndex must verify this! It should count the number of docs with norms for the field, and ensure it lines up with getDocCount() for the field.

asfimport commented 2 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Attached is a quick prototype patch. For Lucene 10 indexes it writes the norm consistently and checkindex cross-checks the norm count against postings getDocCount. For old indexes it maintains the buggy behavior and checkindex ignores norm counts.

Will have to fix a couple tests that were explicitly testing this "ghost" norm behavior.

Also, I think this may have uncovered some previous crabs: I see some random exception tests provoking wrong counts :) This is the benefit of having such consistency and cross-checks in checkIndex. I will try to investigate.

asfimport commented 2 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I know you'll disagree, but I would still like to advocate that the current semantics are sensible. If you do not index a field value then there is no norm value. If you index a TextField, then the norm stores the length normalization factor for this field. If the field happens to produce zero tokens, then the norm value is zero. Otherwise the norm is guaranteed to be different from zero. CheckIndex is verifying this.

These current semantics also have the benefit of making queries fast when users have a proper schema: if all documents have a value for a field then you're guaranteed to get dense norms with fast iterators. With your change if you are unlucky to get a document where the value produces no tokens then you get the overhead of sparse norms for every term query over that field.

The current semantics are more flexible too. If you want to find documents that have a value for the field, use the norms iterator. If you want to find documents that have terms for the field, use a two-phase iterator over the norms that filters out docs that have a norm of zero.

asfimport commented 2 years ago

Robert Muir (@rmuir) (migrated from JIRA)

@jpountz: I only disagree with most of it :)

If there are no terms/postings for a field, then you don't need a norm. Norms don't exist for elasticsearch users to know if they had some value in their json document there, norms are for scoring terms/postings.

I think that's my primary disagreement here, they are being abused by elasticsearch, and now they have crap semantics in lucene because of it.

asfimport commented 2 years ago

Robert Muir (@rmuir) (migrated from JIRA)

These current semantics also have the benefit of making queries fast when users have a proper schema: if all documents have a value for a field then you're guaranteed to get dense norms with fast iterators. With your change if you are unlucky to get a document where the value produces no tokens then you get the overhead of sparse norms for every term query over that field.

This is something to address separately. if a norms field is 95% or 99% dense, we should use a dense representation. It shouldn't rely on ghost values.

asfimport commented 2 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I really do strongly believe we should fix this.

In order to make progress in the meantime, I will make issues to:

  1. Fix checkindex to validate norm count against getDocCount. It can still do this with the broken/buggy stuff of today, it just has to ignore zero values when counting. We should really do this cross-check. To argue that it is validated by checkindex is incorrect, it isn't. And I think we may have some issues with non-aborting exceptions to fix.
  2. Fix Lucene norms format to use dense representation without depending on ghost values. Instead it should be something sensible such as 90%/95%. Most people's data is messy. We should use a sparse representation if its really sparse, but missing a value for a single doc out of millions isn't "sparse".
asfimport commented 2 years ago

Robert Muir (@rmuir) (migrated from JIRA)

@jpountz I opened #11308 to add the checkindex check, with the existing crazy semantics. I don't imagine this part is controversial, and we can enable it with a little work.

asfimport commented 2 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

To be clear I didn't give these semantics to norms because of Elasticsearch, I gave these semantics to norms because they felt sensible for Lucene.

Before 7.0 this question didn't exist since norms had a dense API and no getDocsWithField like doc values. So when we moved to doc-value iterators, we had to figure out what it means for a document to have a norm. In the first versions of the doc-value iterators work (before the 7.0 release), norms were still dense all the time despite the iterator API. This got addressed in #8527, so that you would only "pay for what you actually use". See this comment about the semantics:

the current patch assigns a norm value of zero to fields that generate no tokens (can happen eg. with the empty string or if all tokens are stop words) and only considers that a document does not have norms if no text field were indexed at all. We could also decide that fields that generate no tokens are considered as missing too, I think both approaches can make sense.

To argue that it is validated by checkindex is incorrect, it isn't.

Terms#getDocCount is not, but norms are: CheckIndex creates a bit set of documents that have terms by iterating over all postings and checks it against norms.

And I think we may have some issues with non-aborting exceptions to fix.

It would be nice if we could still guarantee consistency with non-aborting exceptions indeed. Currently CheckIndex ignores deleted documents to work around this problem.

Fix Lucene norms format to use dense representation without depending on ghost values. Instead it should be something sensible such as 90%/95%.

We should be able to reduce the overhead, but still not get the same efficiency as dense norms. Even if a dense representation is used under the hood, if we know that some documents miss a value for the norm we still need to perform the check for every document, so there is still overhead compared to the dense case. Hence my preference for semantics that make norms more likely to be dense.

asfimport commented 2 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Terms#getDocCount is not, but norms are: CheckIndex creates a bit set of documents that have terms by iterating over all postings and checks it against norms.

It isn't about the docCount statistic (which is checked against postings elsewhere), it is that this check doesn't actually fully check. It does not verify equality, only a subset relationship. It iterates the norms, checking each one. But it won't fail if "norms are missing" that should be there, which I don't like. We can probably even improve that without fixing the non-aborting case. It just needs to count up the legit norms and check that this count is equal to (visitedDocs intersect LiveDocs).cardinality() ?

asfimport commented 2 years ago

Robert Muir (@rmuir) (migrated from JIRA)

We should be able to reduce the overhead, but still not get the same efficiency as dense norms. Even if a dense representation is used under the hood, if we know that some documents miss a value for the norm we still need to perform the check for every document, so there is still overhead compared to the dense case. Hence my preference for semantics that make norms more likely to be dense.

I disagree with that, one simple option: we can return 0 just like lucene did before? Then it is perfectly dense without any overhead. If we're gonna support a "ghost norm", then let's use it for real performance: faster searches, not making esoteric ES use-cases faster.

Yes I anticipate your response to this one, but I ask to take a step back and think about what is best.

asfimport commented 2 years ago

Robert Muir (@rmuir) (migrated from JIRA)

To be clear I didn't give these semantics to norms because of Elasticsearch, I gave these semantics to norms because they felt sensible for Lucene.

And I don't mean anything personal or offensive about it. I just think the priorities may be inadvertently skewed. Instead of using ghost norms (norm value = 0) to make all scoring go faster (e.g. allowing them to be a full impl detail like before), we expose these semantics differentiating "norm exists vs value of 0" with an esoteric query to satisfy what i feel is, an elasticsearch-only use-case. This seems to be the wrong tradeoff.