elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.51k stars 24.9k forks source link

Fail properly if ZSTD support is not available #107311

Open jpountz opened 7 months ago

jpountz commented 7 months ago

Description

Because NoopNativeAccess returns null for getZstd and newBuffer, #107289 manifested to Elasticsearch operators in a confusing way, e.g.

   │ info [o.e.n.NativeAccess] [Cri-laptop.home] cannot compress with zstd because native access is not available
   │ info [o.e.i.e.Engine] [Cri-laptop.home] [.security-7][0] failed engine [lucene commit failed] java.lang.NullPointerException: Cannot invoke "org.elasticsearch.nativeaccess.Zstd.compressBound(int)" because "zstd" is null
   │        at org.elasticsearch.server@8.14.0-SNAPSHOT/org.elasticsearch.index.codec.zstd.Zstd814StoredFieldsFormat$ZstdCompressor.compress(Zstd814StoredFieldsFormat.java:176)
   │        at org.apache.lucene.core@9.10.0/org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsWriter.flush(Lucene90CompressingStoredFieldsWriter.java:263)
   │        at org.apache.lucene.core@9.10.0/org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsWriter.finish(Lucene90CompressingStoredFieldsWriter.java:475)
   │        at org.apache.lucene.core@9.10.0/org.apache.lucene.index.StoredFieldsConsumer.flush(StoredFieldsConsumer.java:104)
   │        at org.apache.lucene.core@9.10.0/org.apache.lucene.index.IndexingChain.flush(IndexingChain.java:305)
   │        at org.apache.lucene.core@9.10.0/org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:445)
   │        at org.apache.lucene.core@9.10.0/org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:496)
   │        at org.apache.lucene.core@9.10.0/org.apache.lucene.index.DocumentsWriter.maybeFlush(DocumentsWriter.java:450)
   │        at org.apache.lucene.core@9.10.0/org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:640)
   │        at org.apache.lucene.core@9.10.0/org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3663)
   │        at org.apache.lucene.core@9.10.0/org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:4110)
   │        at org.apache.lucene.core@9.10.0/org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:4072)
   │        at org.elasticsearch.server@8.14.0-SNAPSHOT/org.elasticsearch.index.engine.InternalEngine.commitIndexWriter(InternalEngine.java:2906)

It would have been better if NoopNativeAccess#getZstd and NoopNativeAccess#newBuffer had thrown an exception with a meaningful error message.

Even though it's a bug that we hit this case, as all supported platforms should have ZSTD bindings, it would be nice for this case to fail with a proper error message instead of a cryptic NullPointerException.

A related question is whether we should delay this error until ZSTD gets used for the first time, or whether we should have more proactive checks, e.g. at node startup time.

elasticsearchmachine commented 7 months ago

Pinging @elastic/es-core-infra (Team:Core/Infra)

rjernst commented 7 months ago

The point of noop native access is to be an errorless fallback. But it requires consumers actually check the returned value from native access functions. It seems like the 814 codec could more aggressively check whether Zstd is null? Also note that this was an exceptional case, since zstd is available on all supported platforms, this was just a simple bug that affected x86-64.

I'm supportive of getting rid of noop native access, but I don't think we should make it some hybrid fallback, it's supposed to be noops, not errors.

jpountz commented 7 months ago

I'm not a fan of doing it at the call site. There are only two call sites of ZSTD right now but we envision having more of them in the future, and in my mind ZSTD is a required dependency of Elasticsearch, not an optional dependency. So I would rather have a central place that double checks that ZSTD support is available instead of requiring all call sites to do so.

Getting rid of noop native access would work for me. I guess that another approach could consist of having a bootstrap check that makes sure that ZSTD support is available (and newBuffer).

I'm rephrasing this issue to better align it with the problem rather than a possible solution.

rjernst commented 7 months ago

A bootstrap check would be in line with other native access checks we have. We are lenient on the loading, but then fail in a bootstrap check.