apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.73k stars 1.05k forks source link

checkindex: cross-check norms against docCount [LUCENE-10272] #11308

Closed asfimport closed 3 years ago

asfimport commented 3 years ago

Spinoff from #11307

Despite the current crazy ghost values / elasticsearch abuse of norms, we should still cross-check the data in checkindex.

Terms.getDocCount should equal the number of norms (we just ignore the bogus ghost values for now).

Currently checkindex is missing really any check of norms: and it turns out that non-aborting exceptions create inconsistencies.


Migrated from LUCENE-10272 by Robert Muir (@rmuir), resolved Nov 30 2021 Attachments: LUCENE-10272.patch Pull requests: https://github.com/apache/lucene/pull/493

asfimport commented 3 years ago

Robert Muir (@rmuir) (migrated from JIRA)

See attached patch, I haven't addressed the non-aborting exception case yet, but it reproduces the problem every time:

  2> NOTE: reproduce with: gradlew test --tests TestIndexWriterExceptions.testUpdateDocsNonAbortingException -Dtests.seed=F9962994C8B838B4 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=bs-Cyrl-BA -Dtests.timezone=Asia/Oral -Dtests.asserts=true -Dtests.file.encoding=UTF-8
   >     org.apache.lucene.index.CheckIndex$CheckIndexException: actual norm count: 0 but expected: 1
asfimport commented 3 years ago

Robert Muir (@rmuir) (migrated from JIRA)

From my quick inspection of invert(), it seems the bug is that if we hit non-aborting exc (e.g. tokenizer bug), then we neglect to finalize some stats and stuff in invertstate. So it should just be a matter of adjusting some exc handling here (try/finally/etc)

asfimport commented 3 years ago

Robert Muir (@rmuir) (migrated from JIRA)

also because we had this bug before, we can just fix invert(), and conditionalize the new checkindex check on lucene 10+. I don't want checkindex to annoy users and report old indexes as being corrupt just because they had a few docs that hit a bug in their analysis chain, causing incorrect norms for the deleted doc :)

But I do want better validation of norms going forward, so we have to fix it.

asfimport commented 3 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

If we fix invert() then we can disable this check on live docs on 10+ indices:

        if (fieldInfo.hasNorms() && isVectors == false) {
          final NumericDocValues norms = normsProducer.getNorms(fieldInfo);
          // Cross-check terms with norms
          for (int doc = norms.nextDoc();
              doc != DocIdSetIterator.NO_MORE_DOCS;
              doc = norms.nextDoc()) {
            if (liveDocs != null && liveDocs.get(doc) == false) {
              // Norms may only be out of sync with terms on deleted documents.
              // This happens when a document fails indexing and in that case it
              // should be immediately marked as deleted by the IndexWriter.
              continue;
            }
            final long norm = norms.longValue();
            if (norm != 0 && visitedDocs.get(doc) == false) {
              throw new CheckIndexException(
                  "Document "
                      + doc
                      + " doesn't have terms according to postings but has a norm value that is not zero: "
                      + Long.toUnsignedString(norm));
            } else if (norm == 0 && visitedDocs.get(doc)) {
              throw new CheckIndexException(
                  "Document "
                      + doc
                      + " has terms according to postings but its norm value is 0, which may only be used on documents that have no terms");
            }
          }
        }
asfimport commented 3 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Right, we could even remove my check, if we improved that check to be a bit stronger to match it. I think if you consider Venn diagram of all the possibilities, some properties aren't validated by this.

Personally, i prefer comparing exact counts, as you know that they are exactly equal (vs subsets or whatever)

asfimport commented 3 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I created a PR, just extending your existing check so that it is really a two-way cross-check: this way if norms are missing it fails.

I'd like to defer the aborting-exceptions to another issue, it is less important to me now: I just want simple progress on the checkindex part here. Norms have historically not had much love from checkindex, so any validation we can do is really important.

asfimport commented 3 years ago

ASF subversion and git services (migrated from JIRA)

Commit 46a5a57724519f349728a7aa613d0b1fe77a8c14 in lucene's branch refs/heads/main from Robert Muir https://gitbox.apache.org/repos/asf?p=lucene.git;h=46a5a57

LUCENE-10272: cross-check norms with postings in checkindex (#493)

Previously, CheckIndex would iterate norms and validate each one. But if norms that should be there were missing, nothing would fail. Now it computes an expected count of norms and ensures it saw them all.

asfimport commented 3 years ago

ASF subversion and git services (migrated from JIRA)

Commit c89c78cee0fbcd88b23d5772ea02f785865add7a in lucene's branch refs/heads/branch_9x from Robert Muir https://gitbox.apache.org/repos/asf?p=lucene.git;h=c89c78c

LUCENE-10272: cross-check norms with postings in checkindex (#493)

Previously, CheckIndex would iterate norms and validate each one. But if norms that should be there were missing, nothing would fail. Now it computes an expected count of norms and ensures it saw them all.

asfimport commented 3 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Fixed, without tackling non-aborting exceptions in IndexWriter. We may still revisit that one later, to simplify CheckIndex.

asfimport commented 2 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Closing after the 9.1.0 release.