apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.59k stars 1.01k forks source link

Lucene99FlatVectorsReader.getFloatVectorValues(): NPE: Cannot read field "vectorEncoding" because "fieldEntry" is null #13626

Closed david-sitsky closed 1 day ago

david-sitsky commented 1 month ago

Description

One of our internal users hit this error when merging their index after loading all documents which contain some vector fields. I couldn't reproduce this myself. Is there a corner case which has triggered this?

java.lang.NullPointerException: Cannot read field "vectorEncoding" because "fieldEntry" is null
    at org.apache.lucene.codecs.lucene99.Lucene99FlatVectorsReader.getFloatVectorValues(Lucene99FlatVectorsReader.java:234) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]
    at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader.getFloatVectorValues(Lucene99HnswVectorsReader.java:240) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]
    at org.apache.lucene.codecs.KnnVectorsWriter$MergedVectorValues.mergeFloatVectorValues(KnnVectorsWriter.java:158) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]
    at org.apache.lucene.codecs.lucene99.Lucene99FlatVectorsWriter.mergeOneFieldToIndex(Lucene99FlatVectorsWriter.java:312) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]
    at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsWriter.mergeOneField(Lucene99HnswVectorsWriter.java:349) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]
    at org.apache.lucene.codecs.KnnVectorsWriter.merge(KnnVectorsWriter.java:99) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]
    at org.apache.lucene.index.SegmentMerger.mergeVectorValues(SegmentMerger.java:257) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]
    at org.apache.lucene.index.SegmentMerger.mergeWithLogging(SegmentMerger.java:300) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]
    at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:151) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]
    at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:5293) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]
    at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4761) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]
    at org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:6582) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]
    at org.apache.lucene.index.SerialMergeScheduler.merge(SerialMergeScheduler.java:38) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]
    at org.apache.lucene.index.IndexWriter.executeMerge(IndexWriter.java:2326) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]
    at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2321) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]
    at org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:2162) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]

Version and environment details

This is with Lucene 9.11.1.

benwtrent commented 1 month ago

I do see we don't protect against null values like we should. I am not 100% of the scenario where this would happen, but we protect against missing fields in other paths. We should protect this one.

benwtrent commented 1 month ago

@david-sitsky is the perfield codec being used? Was this all in the same version or were they merging segments from different versions into the same one?

Looking at the code, it seems we always assume that if we are merging a vector field, it will have a field entry in the underlying knn reader.

david-sitsky commented 1 month ago

@benwtrent - I am calling IndexWriterConfig.setCodec(new Lucene99VectorCodec()) on the codec below:

public class Lucene99VectorCodec extends FilterCodec
{
    public Lucene99VectorCodec()
    {
        super("Lucene99VectorCodec", new Lucene99Codec());
    }

    @Override
    public KnnVectorsFormat knnVectorsFormat()
    {
        return new Lucene99HnswVectorsFormat(16, 250);
    }
}

This was a brand new index with this codec in place. In this index, since each document can be associated with multiple vectors, I do have multiple child documents containing the vector fields associated with a single parent document.

david-sitsky commented 1 month ago

@benwtrent - is there a change we can make to stop the NPE? At the moment it prevents our product from completing the index operation and is a bit of a blocker..

It would be great if this could make it into the next minor release.

benwtrent commented 1 month ago

@david-sitsky I haven't been able to dig further into this yet. Could you write a test that replicates the failure? I have yet to be able to replicate the failure.

I THINK its due to not using the perfield format codec, and thus we don't get the logic that the field actually exists for free.

benwtrent commented 1 month ago

OK, I was able to replicate with the following test:

 public void testTryToThrowNPE() throws Exception {
    try (var dir = newDirectory()) {
      IndexWriterConfig iwc = new IndexWriterConfig();
      iwc.setMergePolicy(new ForceMergePolicy(iwc.getMergePolicy())).setCodec(new Lucene912VectorCodec());
      try (var writer = new IndexWriter(dir, iwc)) {
        for (int i = 0; i < 10; i++) {
          var doc = new Document();
          if (random().nextBoolean()) {
            doc.add(new KnnFloatVectorField("field", new float[] {1, 2, 3}));
          }
          writer.addDocument(doc);
          if (random().nextBoolean()) {
            writer.commit();
          }
        }
        for (int i = 0; i < 10; i++) {
          var doc = new Document();
          if (random().nextBoolean()) {
            // add a vector but a different field
            doc.add(new KnnFloatVectorField("otherVector", new float[] {1, 2, 3}));
          }
          writer.addDocument(doc);
          if (random().nextBoolean()) {
            writer.commit();
          }
        }
        writer.forceMerge(1);
      }
    }
  }

The error goes away if you use the perField codec like so:

public static class Lucene99VectorCodec extends FilterCodec
  {
    public Lucene99VectorCodec() {
      super("Lucene99VectorCodec", new Lucene99Codec());
    }

    @Override
    public KnnVectorsFormat knnVectorsFormat() {
      return new PerFieldKnnVectorsFormat() {
        @Override
        public KnnVectorsFormat getKnnVectorsFormatForField(String field) {
          return new Lucene99HnswVectorsFormat(16, 250);
        }
      };
    }
  }

I am not sure its valid to have more than one kNN field and not use the perfield format. The logic seems to make that assumption.

@msokolov ^ What do you think of this bug. I am not sure what our behavior should be. We can easily check for a null field and return null.

benwtrent commented 1 month ago

Looking at what PointValues does:

    FieldInfo fieldInfo = readState.fieldInfos.fieldInfo(fieldName);
    if (fieldInfo == null) {
      throw new IllegalArgumentException("field=\"" + fieldName + "\" is unrecognized");
    }

Maybe kNN should do the same thing. At least its a better error than an NPE.

msokolov commented 1 month ago

It makes sense to me to add null checks. I don't think PerFieldCodec should be required? The usage seems legit to me - basically it defines a set of params to use with all knn fields. Also it seems totally kosher that some docs would have one knn field and other ones a different knn field. I'm struggling to understand why this leads to a broken assumption. Can we reasonably skip trying to merge null fields?

david-sitsky commented 1 month ago

@benwtrent - many thanks for your advice. Switching to PerFieldKnnVectorsFormat fixed the issue for us.

msokolov commented 1 month ago

OK, I don't know if we want to go back and revisit all this stuff, but it seems to me as if we stumbled into requiring PerFieldCodec without necessarily realizing we had done so? Not sure if there is a better way to communicate that in the API, or if we need to actually support the non-PFC pathway (seems not, since we have a good way of doing this). I guess what I wonder is whether PFC shoulkd get "baked in" somewhere - like Codec?