zlib 1.2.12 getting corruption errors

mailme-gx commented 2 years ago

running ES 7.1.2 on archlinux, after zlib was upgraded from 1.2.11 to 1.2.12 the service did not start

Taking this opportunity to upgrade to the latest elasticsearch I installed ES 8.1.0 single node with no existing data and got the same issue, after downgrading zlib both versions of ES work fine

sample stack trace

{"@timestamp":"2022-03-31T04:58:24.686Z", "log.level": "WARN", "message":"failing [elected-as-master ([1] nodes joined)[{gxdev1}{ntQC1xXORxaS-X7rjU0w-A}{Hye6nHtRT7iZ4qxd9FGUeg}{127.0.0.1}{127.0.0.1:9300}{cdfhilmrstw} completing election,
_BECOME_MASTER_TASK_, _FINISH_ELECTION_]]: failed to commit cluster state version [79]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[gxdev1][masterService#upda
teTask][T#1]","log.logger":"org.elasticsearch.cluster.service.MasterService","elasticsearch.cluster.uuid":"55_PjKTLS5-yDT-K-pkh6w","elasticsearch.node.id":"ntQC1xXORxaS-X7rjU0w-A","elasticsearch.node.name":"gxdev1","elasticsearch.cluster.
name":"elasticsearch","error.type":"org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException","error.message":"publication failed","error.stack_trace":"org.elasticsearch.cluster.coordination.FailedToCommitClusterStateExc
eption: publication failed\n\tat org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication$4.onFailure(Coordinator.java:1718)\n\tat org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListenerDirectly(Listenabl
eFuture.java:115)\n\tat org.elasticsearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:55)\n\tat org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication.onCompletion(Coordinator.java:1625)\n\
tat org.elasticsearch.cluster.coordination.Publication.onPossibleCompletion(Publication.java:114)\n\tat org.elasticsearch.cluster.coordination.Publication.onPossibleCommitFailure(Publication.java:165)\n\tat org.elasticsearch.cluster.coord
ination.Publication$PublicationTarget$PublishResponseHandler.onFailure(Publication.java:376)\n\tat org.elasticsearch.cluster.coordination.Coordinator$4.onFailure(Coordinator.java:1371)\n\tat org.elasticsearch.cluster.coordination.Publicat
ionTransportHandler$PublicationContext$1.onFailure(PublicationTransportHandler.java:360)\n\tat org.elasticsearch.cluster.coordination.PublicationTransportHandler$PublicationContext.lambda$sendClusterStateDiff$7(PublicationTransportHandler
.java:438)\n\tat org.elasticsearch.action.ActionListener$DelegatingActionListener.onFailure(ActionListener.java:192)\n\tat org.elasticsearch.action.ActionListener$Delegating.onFailure(ActionListener.java:66)\n\tat org.elasticsearch.action
.ActionListener$RunAfterActionListener.onFailure(ActionListener.java:350)\n\tat org.elasticsearch.action.ActionListener$Delegating.onFailure(ActionListener.java:66)\n\tat org.elasticsearch.action.ActionListener$RunAfterActionListener.onFa
ilure(ActionListener.java:350)\n\tat org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:48)\n\tat org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleExce
ption(TransportService.java:1349)\n\tat org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1458)\n\tat org.elasticsearch.transport.TransportService$DirectResponseChannel$2.run(Transpo
rtService.java:1437)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:717)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat java.base/java.lang.Thread.run(Thread.java:833)\nCaused by: org.elasticsearch.common.util.concurrent.UncategorizedExecutionException: Failed e
xecution\n\tat org.elasticsearch.common.util.concurrent.FutureUtils.rethrowExecutionException(FutureUtils.java:80)\n\tat org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:72)\n\tat org.elasticsearch.common.util.con
current.ListenableFuture.notifyListenerDirectly(ListenableFuture.java:112)\n\t... 21 more\nCaused by: java.util.concurrent.ExecutionException: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=2
e603023 actual=f0db10c0 (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path=\"/mq_cluster/data/elasticsearch/_state/_9b.fdt\")))\n\tat org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:257)\n\tat org.
elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:231)\n\tat org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:53)\n\tat org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.jav
a:65)\n\t... 22 more\nCaused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=2e603023 actual=f0db10c0 (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path=\"/mq_cluster/data/elasticse
arch/_state/_9b.fdt\")))\n\tat org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:440)\n\tat org.apache.lucene.codecs.lucene90.Lucene90CompoundFormat.writeCompoundFile(Lucene90CompoundFormat.java:123)\n\tat org.apache.lucene.co
decs.lucene90.Lucene90CompoundFormat.write(Lucene90CompoundFormat.java:98)\n\tat org.apache.lucene.index.IndexWriter.createCompoundFile(IndexWriter.java:5563)\n\tat org.apache.lucene.index.DocumentsWriterPerThread.sealFlushedSegment(Docum
entsWriterPerThread.java:537)\n\tat org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:468)\n\tat org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:497)\n\tat org.apache.lucene.index.Do
cumentsWriter.flushAllThreads(DocumentsWriter.java:676)\n\tat org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:4014)\n\tat org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3988)\n\tat org.apache.lucene.index.IndexWri
ter.flush(IndexWriter.java:3967)\n\tat org.elasticsearch.gateway.PersistedClusterStateService$MetadataIndexWriter.flush(PersistedClusterStateService.java:692)\n\tat org.elasticsearch.gateway.PersistedClusterStateService$Writer.addMetadata
(PersistedClusterStateService.java:991)\n\tat org.elasticsearch.gateway.PersistedClusterStateService$Writer.overwriteMetadata(PersistedClusterStateService.java:975)\n\tat org.elasticsearch.gateway.PersistedClusterStateService$Writer.write
FullStateAndCommit(PersistedClusterStateService.java:788)\n\tat org.elasticsearch.gateway.GatewayMetaState$LucenePersistedState.setLastAcceptedState(GatewayMetaState.java:504)\n\tat org.elasticsearch.cluster.coordination.CoordinationState
.handlePublishRequest(CoordinationState.java:392)\n\tat org.elasticsearch.cluster.coordination.Coordinator.handlePublishRequest(Coordinator.java:418)\n\tat org.elasticsearch.cluster.coordination.PublicationTransportHandler.acceptState(Pub
licationTransportHandler.java:200)\n\tat org.elasticsearch.cluster.coordination.PublicationTransportHandler.handleIncomingPublishRequest(PublicationTransportHandler.java:183)\n\tat org.elasticsearch.cluster.coordination.PublicationTranspo
rtHandler.lambda$new$0(PublicationTransportHandler.java:103)\n\tat org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:67)\n\tat org.elasticsearch.transport.TransportService$6.doRun(Transp
ortService.java:917)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:776)\n\tat org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26
)\n\t... 3 more\n"}

org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: publication failed
        at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication$4.onFailure(Coordinator.java:1718)
        at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListenerDirectly(ListenableFuture.java:115)
        at org.elasticsearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:55)
        at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication.onCompletion(Coordinator.java:1625)
        at org.elasticsearch.cluster.coordination.Publication.onPossibleCompletion(Publication.java:114)
        at org.elasticsearch.cluster.coordination.Publication.onPossibleCommitFailure(Publication.java:165)
        at org.elasticsearch.cluster.coordination.Publication$PublicationTarget$PublishResponseHandler.onFailure(Publication.java:376)
        at org.elasticsearch.cluster.coordination.Coordinator$4.onFailure(Coordinator.java:1371)
        at org.elasticsearch.cluster.coordination.PublicationTransportHandler$PublicationContext$1.onFailure(PublicationTransportHandler.java:360)
        at org.elasticsearch.cluster.coordination.PublicationTransportHandler$PublicationContext.lambda$sendClusterStateDiff$7(PublicationTransportHandler.java:438)
        at org.elasticsearch.action.ActionListener$DelegatingActionListener.onFailure(ActionListener.java:192)
        at org.elasticsearch.action.ActionListener$Delegating.onFailure(ActionListener.java:66)
        at org.elasticsearch.action.ActionListener$RunAfterActionListener.onFailure(ActionListener.java:350)
        at org.elasticsearch.action.ActionListener$Delegating.onFailure(ActionListener.java:66)
        at org.elasticsearch.action.ActionListener$RunAfterActionListener.onFailure(ActionListener.java:350)
        at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:48)
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1349)
        at org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1458)
        at org.elasticsearch.transport.TransportService$DirectResponseChannel$2.run(TransportService.java:1437)
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:717)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        atjava.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: org.elasticsearch.common.util.concurrent.UncategorizedExecutionException: Failed execution
        at org.elasticsearch.common.util.concurrent.FutureUtils.rethrowExecutionException(FutureUtils.java:80)
        at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:72)
        at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListenerDirectly(ListenableFuture.java:112)
        ... 21 more
Caused by: java.util.concurrent.ExecutionException: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=2e603023 actual=f0db10c0 (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/mq_cluster/data/elasticsearch/_state/_9b.fdt")))
        at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:257)
        at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:231)
        at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:53)
        at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:65)
        ... 22 more
Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=2e603023 actual=f0db10c0 (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/mq_cluster/data/elasticsearch/_state/_9b.fdt")))
        at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:440)
        at org.apache.lucene.codecs.lucene90.Lucene90CompoundFormat.writeCompoundFile(Lucene90CompoundFormat.java:123)
        at org.apache.lucene.codecs.lucene90.Lucene90CompoundFormat.write(Lucene90CompoundFormat.java:98)
        at org.apache.lucene.index.IndexWriter.createCompoundFile(IndexWriter.java:5563)
        at org.apache.lucene.index.DocumentsWriterPerThread.sealFlushedSegment(DocumentsWriterPerThread.java:537)
        at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:468)
        at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:497)
        at org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:676)
        at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:4014)
        at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3988)
        at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3967)
        at org.elasticsearch.gateway.PersistedClusterStateService$MetadataIndexWriter.flush(PersistedClusterStateService.java:692)
        at org.elasticsearch.gateway.PersistedClusterStateService$Writer.addMetadata(PersistedClusterStateService.java:991)
        at org.elasticsearch.gateway.PersistedClusterStateService$Writer.overwriteMetadata(PersistedClusterStateService.java:975)
        at org.elasticsearch.gateway.PersistedClusterStateService$Writer.writeFullStateAndCommit(PersistedClusterStateService.java:788)
        at org.elasticsearch.gateway.GatewayMetaState$LucenePersistedState.setLastAcceptedState(GatewayMetaState.java:504)
        at org.elasticsearch.cluster.coordination.CoordinationState.handlePublishRequest(CoordinationState.java:392)
        at org.elasticsearch.cluster.coordination.Coordinator.handlePublishRequest(Coordinator.java:418)
        at org.elasticsearch.cluster.coordination.PublicationTransportHandler.acceptState(PublicationTransportHandler.java:200)
        at org.elasticsearch.cluster.coordination.PublicationTransportHandler.handleIncomingPublishRequest(PublicationTransportHandler.java:183)
        at org.elasticsearch.cluster.coordination.PublicationTransportHandler.lambda$new$0(PublicationTransportHandler.java:103)
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:67)
        at org.elasticsearch.transport.TransportService$6.doRun(TransportService.java:917)
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:776)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26
        ... 3 more

DaveCTurner commented 2 years ago

7.1.2 is over a year past EOL and Arch isn't one of the supported Linux distributions. Can you reproduce this in a supported config?

elasticmachine commented 2 years ago

Pinging @elastic/es-core-infra (Team:Core/Infra)

DaveCTurner commented 2 years ago

Also, what exactly is your platform as reported by uname -a? This doesn't seem to reproduce on my box:

$ uname -a
Linux david-turner 5.4.0-107-generic #121-Ubuntu SMP Thu Mar 24 16:04:27 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

However the latest zlib does have some platform-specific changes to how CRCs are calculated which might explain this.

DaveCTurner commented 2 years ago

Also also please could you start again from an empty data path, reproduce the problem, then make a copy of the whole data path, zip it up and share it here?

mailme-gx commented 2 years ago

Also, what exactly is your platform as reported by uname -a? This doesn't seem to reproduce on my box:
$ uname -a
Linux david-turner 5.4.0-107-generic #121-Ubuntu SMP Thu Mar 24 16:04:27 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
However the latest zlib does have some platform-specific changes to how CRCs are calculated which might explain this.

Hi Dave, here is the platform info

# uname -a
Linux gxdev1 5.17.1-arch1-1 #1 SMP PREEMPT Mon, 28 Mar 2022 20:55:33 +0000 x86_64 GNU/Linux

# java -version
openjdk version "11.0.15" 2022-04-19
OpenJDK Runtime Environment (build 11.0.15+3)
OpenJDK 64-Bit Server VM (build 11.0.15+3, mixed mode)

This was tested with 8.1.0 and a clean datapath config options are

pack.security.enabled: false
ingest.geoip.downloader.enabled: false
discovery.type: single-node
path.data: /mq_cluster/data/elasticsearch
network.bind_host: ["_local_", "_site_"]

DaveCTurner commented 2 years ago

openjdk version "11.0.15" 2022-04-19

Are you using this JDK or are you using the bundled one? The node will log a message including JVM home during startup. What exactly does this message say?

Could you remove the contents of /mq_cluster/data/elasticsearch, reproduce the problem again, and then zip up /mq_cluster/data/elasticsearch and share it here please?

mailme-gx commented 2 years ago

Hi Dave, my apologizes, ES is yusing java 17, java 11 is the default for other apps

# archlinux-java status
Available Java environments:
  java-11-openjdk (default)
  java-17-openjdk

ES-8.1-data.tar.gz

mailme-gx commented 2 years ago

also here is a clean full log file

elasticsearch.log

DaveCTurner commented 2 years ago

Are you sure these files correspond to the same failure? The log reports problems loading _9d.fdt but there is no _9d.fdt in the state path:

Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=2e44be6 actual=dc5f6b05 (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/mq_cluster/data/elasticsearch/_state/_9d.fdt")))

$ TZ=Etc/UTC tar tvf ES-8.1-data.tar.gz  | grep 'h/_state'
drwxr-xr-x elasticsearch/elasticsearch   0 2022-04-05 05:06 elasticsearch/_state/
-rw-r--r-- elasticsearch/elasticsearch   0 2022-04-05 04:54 elasticsearch/_state/write.lock
-rw-r--r-- elasticsearch/elasticsearch 109 2022-04-05 04:54 elasticsearch/_state/manifest-0.st
-rw-r--r-- elasticsearch/elasticsearch 115 2022-04-05 04:54 elasticsearch/_state/node-0.st
-rw-r--r-- elasticsearch/elasticsearch 25008 2022-04-05 04:55 elasticsearch/_state/_92.cfs
-rw-r--r-- elasticsearch/elasticsearch   278 2022-04-05 04:55 elasticsearch/_state/_92.cfe
-rw-r--r-- elasticsearch/elasticsearch   359 2022-04-05 04:55 elasticsearch/_state/_92.si
-rw-r--r-- elasticsearch/elasticsearch  2018 2022-04-05 05:05 elasticsearch/_state/_98.cfs
-rw-r--r-- elasticsearch/elasticsearch   278 2022-04-05 05:05 elasticsearch/_state/_98.cfe
-rw-r--r-- elasticsearch/elasticsearch   359 2022-04-05 05:05 elasticsearch/_state/_98.si
-rw-r--r-- elasticsearch/elasticsearch  2130 2022-04-05 05:05 elasticsearch/_state/_9b.cfs
-rw-r--r-- elasticsearch/elasticsearch   278 2022-04-05 05:05 elasticsearch/_state/_9b.cfe
-rw-r--r-- elasticsearch/elasticsearch   359 2022-04-05 05:05 elasticsearch/_state/_9b.si
-rw-r--r-- elasticsearch/elasticsearch    64 2022-04-05 05:06 elasticsearch/_state/_9c.fdx
-rw-r--r-- elasticsearch/elasticsearch   267 2022-04-05 05:06 elasticsearch/_state/_9c_Lucene90_0.tmd
-rw-r--r-- elasticsearch/elasticsearch   319 2022-04-05 05:06 elasticsearch/_state/_9c.fnm
-rw-r--r-- elasticsearch/elasticsearch 26296 2022-04-05 05:06 elasticsearch/_state/_9c.cfs
-rw-r--r-- elasticsearch/elasticsearch   156 2022-04-05 05:06 elasticsearch/_state/_9c.cfe
-rw-r--r-- elasticsearch/elasticsearch   445 2022-04-05 05:05 elasticsearch/_state/segments_4x
-rw-r--r-- elasticsearch/elasticsearch   157 2022-04-05 05:06 elasticsearch/_state/_9c.fdm
-rw-r--r-- elasticsearch/elasticsearch 25688 2022-04-05 05:06 elasticsearch/_state/_9c.fdt
-rw-r--r-- elasticsearch/elasticsearch    79 2022-04-05 05:06 elasticsearch/_state/_9c_Lucene90_0.doc
-rw-r--r-- elasticsearch/elasticsearch   148 2022-04-05 05:06 elasticsearch/_state/_9c_Lucene90_0.tim
-rw-r--r-- elasticsearch/elasticsearch    73 2022-04-05 05:06 elasticsearch/_state/_9c_Lucene90_0.tip

Also the dates on those files (in UTC) look too old to be a fresh reproduction. Could you try again?

mailme-gx commented 2 years ago

rm -rf /mq_cluster/data/elasticsearch/* && rm /usr/share/elasticsearch/logs/* && systemctl start elasticsearch.service

elasticsearch.log ES-8.1-data.tar.gz

DaveCTurner commented 2 years ago

Frustratingly it looks like Lucene is deleting the files it claims to be corrupt so there's nothing useful here. However it is interesting that it fails so frequently for you (I count 80 failures in the 3 minutes of logs you shared). In my system I see no such problems with zlib-1.2.12, it all works just fine.

The previous failure looks to be due to a checksum failure in _9c.fdt, at least because we copied most of this file into _9c.cfs but stopped before writing the checksum. The checksum in this file is correct, and on my machine both versions of zlib return the correct checksum under all sorts of different read patterns.

Could you try reproducing this on a different physical machine? It would be useful to do that to rule out some hardware fault (bad CPU perhaps). If this is a software problem then I'd expect many similar reports, although I note that zlib-1.2.12 is only about a week old so you might be the first person to hit it.

mailme-gx commented 2 years ago

Hi Dave, this is on a xfs file system, I also tried on ext4 just in case but that has the same result.

This is not a physical machine but a VM running on proxmox

# cat /proc/cpuinfo 
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 15
model       : 6
model name  : Common KVM processor
stepping    : 1
microcode   : 0x1
cpu MHz     : 3292.376
cache size  : 16384 KB
physical id : 0
siblings    : 4
core id     : 0
cpu cores   : 4
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm constant_tsc nopl xtopology cpuid tsc_known_freq pni cx16 x2apic hypervisor lahf_lm cpuid_fault pti
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips    : 6587.11
clflush size    : 64
cache_alignment : 128
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 15
model       : 6
model name  : Common KVM processor
stepping    : 1
microcode   : 0x1
cpu MHz     : 3292.376
cache size  : 16384 KB
physical id : 0
siblings    : 4
core id     : 1
cpu cores   : 4
apicid      : 1
initial apicid  : 1
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm constant_tsc nopl xtopology cpuid tsc_known_freq pni cx16 x2apic hypervisor lahf_lm cpuid_fault pti
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips    : 6587.11
clflush size    : 64
cache_alignment : 128
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor   : 2
vendor_id   : GenuineIntel
cpu family  : 15
model       : 6
model name  : Common KVM processor
stepping    : 1
microcode   : 0x1
cpu MHz     : 3292.376
cache size  : 16384 KB
physical id : 0
siblings    : 4
core id     : 2
cpu cores   : 4
apicid      : 2
initial apicid  : 2
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm constant_tsc nopl xtopology cpuid tsc_known_freq pni cx16 x2apic hypervisor lahf_lm cpuid_fault pti
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips    : 6587.11
clflush size    : 64
cache_alignment : 128
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor   : 3
vendor_id   : GenuineIntel
cpu family  : 15
model       : 6
model name  : Common KVM processor
stepping    : 1
microcode   : 0x1
cpu MHz     : 3292.376
cache size  : 16384 KB
physical id : 0
siblings    : 4
core id     : 3
cpu cores   : 4
apicid      : 3
initial apicid  : 3
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm constant_tsc nopl xtopology cpuid tsc_known_freq pni cx16 x2apic hypervisor lahf_lm cpuid_fault pti
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips    : 6587.11
clflush size    : 64
cache_alignment : 128
address sizes   : 40 bits physical, 48 bits virtual
power management:

I will try again on physical hardware and post back

mailme-gx commented 2 years ago

here is the log on physical hardware and cpu info, I cant upload the data dir (maybe its too big? 40mb)

# cat /proc/cpuinfo 
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 58
model name  : Intel(R) Core(TM) i7-3517UE CPU @ 1.70GHz
stepping    : 9
microcode   : 0x17
cpu MHz     : 2800.000
cache size  : 4096 KB
physical id : 0
siblings    : 4
core id     : 0
cpu cores   : 2
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts
vmx flags   : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds
bogomips    : 4391.85
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 6
model       : 58
model name  : Intel(R) Core(TM) i7-3517UE CPU @ 1.70GHz
stepping    : 9
microcode   : 0x17
cpu MHz     : 1696.120
cache size  : 4096 KB
physical id : 0
siblings    : 4
core id     : 0
cpu cores   : 2
apicid      : 1
initial apicid  : 1
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts
vmx flags   : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds
bogomips    : 4391.85
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor   : 2
vendor_id   : GenuineIntel
cpu family  : 6
model       : 58
model name  : Intel(R) Core(TM) i7-3517UE CPU @ 1.70GHz
stepping    : 9
microcode   : 0x17
cpu MHz     : 2200.000
cache size  : 4096 KB
physical id : 0
siblings    : 4
core id     : 1
cpu cores   : 2
apicid      : 2
initial apicid  : 2
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts
vmx flags   : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds
bogomips    : 4391.85
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor   : 3
vendor_id   : GenuineIntel
cpu family  : 6
model       : 58
model name  : Intel(R) Core(TM) i7-3517UE CPU @ 1.70GHz
stepping    : 9
microcode   : 0x17
cpu MHz     : 1696.118
cache size  : 4096 KB
physical id : 0
siblings    : 4
core id     : 1
cpu cores   : 2
apicid      : 3
initial apicid  : 3
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts
vmx flags   : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds
bogomips    : 4391.85
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

elasticsearch.log

DaveCTurner commented 2 years ago

The log from this test contains no errors which strongly suggests there's something wrong with that specific system rather than a problem in Elasticsearch.

mailme-gx commented 2 years ago

Yes I believe you are right, I will look at the cpu flags available on proxmox also a new version is available in the package repo 8.1.1 so I will try that too

mailme-gx commented 2 years ago

Frustratingly it looks like Lucene is deleting the files it claims to be corrupt so there's nothing useful here. However it is interesting that it fails so frequently for you (I count 80 failures in the 3 minutes of logs you shared). In my system I see no such problems with zlib-1.2.12, it all works just fine.

The previous failure looks to be due to a checksum failure in _9c.fdt, at least because we copied most of this file into _9c.cfs but stopped before writing the checksum. The checksum in this file is correct, and on my machine both versions of zlib return the correct checksum under all sorts of different read patterns.

Could you try reproducing this on a different physical machine? It would be useful to do that to rule out some hardware fault (bad CPU perhaps). If this is a software problem then I'd expect many similar reports, although I note that zlib-1.2.12 is only about a week old so you might be the first person to hit it.

Dave, here are my findings

upgrade (from 8.1.0) to 8.1.1 - same result in proxmox disable all cpu flags, enable all cpu flags (flags that can be enabled) - same result in proxmox enable NUMA - same result in proxmox change cpu (from kvm64) to qemu64 - same result Finally in proxmox change cpu (from kvm64) to Haswell - issue resolved!

Thanks for your help

DaveCTurner commented 2 years ago

I am closing this issue as its cause was environmental so there's no action for the Elasticsearch team to take.

elastic / elasticsearch

zlib 1.2.12 getting corruption errors #85546