Closed mailme-gx closed 2 years ago
7.1.2 is over a year past EOL and Arch isn't one of the supported Linux distributions. Can you reproduce this in a supported config?
Pinging @elastic/es-core-infra (Team:Core/Infra)
Also, what exactly is your platform as reported by uname -a
? This doesn't seem to reproduce on my box:
$ uname -a
Linux david-turner 5.4.0-107-generic #121-Ubuntu SMP Thu Mar 24 16:04:27 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
However the latest zlib
does have some platform-specific changes to how CRCs are calculated which might explain this.
Also also please could you start again from an empty data path, reproduce the problem, then make a copy of the whole data path, zip it up and share it here?
Also, what exactly is your platform as reported by
uname -a
? This doesn't seem to reproduce on my box:$ uname -a Linux david-turner 5.4.0-107-generic #121-Ubuntu SMP Thu Mar 24 16:04:27 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
However the latest
zlib
does have some platform-specific changes to how CRCs are calculated which might explain this.
Hi Dave, here is the platform info
# uname -a
Linux gxdev1 5.17.1-arch1-1 #1 SMP PREEMPT Mon, 28 Mar 2022 20:55:33 +0000 x86_64 GNU/Linux
# java -version
openjdk version "11.0.15" 2022-04-19
OpenJDK Runtime Environment (build 11.0.15+3)
OpenJDK 64-Bit Server VM (build 11.0.15+3, mixed mode)
This was tested with 8.1.0 and a clean datapath config options are
pack.security.enabled: false
ingest.geoip.downloader.enabled: false
discovery.type: single-node
path.data: /mq_cluster/data/elasticsearch
network.bind_host: ["_local_", "_site_"]
openjdk version "11.0.15" 2022-04-19
Are you using this JDK or are you using the bundled one? The node will log a message including JVM home
during startup. What exactly does this message say?
Could you remove the contents of /mq_cluster/data/elasticsearch
, reproduce the problem again, and then zip up /mq_cluster/data/elasticsearch
and share it here please?
Hi Dave, my apologizes, ES is yusing java 17, java 11 is the default for other apps
# archlinux-java status
Available Java environments:
java-11-openjdk (default)
java-17-openjdk
also here is a clean full log file
Are you sure these files correspond to the same failure? The log reports problems loading _9d.fdt
but there is no _9d.fdt
in the state path:
Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=2e44be6 actual=dc5f6b05 (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/mq_cluster/data/elasticsearch/_state/_9d.fdt")))
$ TZ=Etc/UTC tar tvf ES-8.1-data.tar.gz | grep 'h/_state'
drwxr-xr-x elasticsearch/elasticsearch 0 2022-04-05 05:06 elasticsearch/_state/
-rw-r--r-- elasticsearch/elasticsearch 0 2022-04-05 04:54 elasticsearch/_state/write.lock
-rw-r--r-- elasticsearch/elasticsearch 109 2022-04-05 04:54 elasticsearch/_state/manifest-0.st
-rw-r--r-- elasticsearch/elasticsearch 115 2022-04-05 04:54 elasticsearch/_state/node-0.st
-rw-r--r-- elasticsearch/elasticsearch 25008 2022-04-05 04:55 elasticsearch/_state/_92.cfs
-rw-r--r-- elasticsearch/elasticsearch 278 2022-04-05 04:55 elasticsearch/_state/_92.cfe
-rw-r--r-- elasticsearch/elasticsearch 359 2022-04-05 04:55 elasticsearch/_state/_92.si
-rw-r--r-- elasticsearch/elasticsearch 2018 2022-04-05 05:05 elasticsearch/_state/_98.cfs
-rw-r--r-- elasticsearch/elasticsearch 278 2022-04-05 05:05 elasticsearch/_state/_98.cfe
-rw-r--r-- elasticsearch/elasticsearch 359 2022-04-05 05:05 elasticsearch/_state/_98.si
-rw-r--r-- elasticsearch/elasticsearch 2130 2022-04-05 05:05 elasticsearch/_state/_9b.cfs
-rw-r--r-- elasticsearch/elasticsearch 278 2022-04-05 05:05 elasticsearch/_state/_9b.cfe
-rw-r--r-- elasticsearch/elasticsearch 359 2022-04-05 05:05 elasticsearch/_state/_9b.si
-rw-r--r-- elasticsearch/elasticsearch 64 2022-04-05 05:06 elasticsearch/_state/_9c.fdx
-rw-r--r-- elasticsearch/elasticsearch 267 2022-04-05 05:06 elasticsearch/_state/_9c_Lucene90_0.tmd
-rw-r--r-- elasticsearch/elasticsearch 319 2022-04-05 05:06 elasticsearch/_state/_9c.fnm
-rw-r--r-- elasticsearch/elasticsearch 26296 2022-04-05 05:06 elasticsearch/_state/_9c.cfs
-rw-r--r-- elasticsearch/elasticsearch 156 2022-04-05 05:06 elasticsearch/_state/_9c.cfe
-rw-r--r-- elasticsearch/elasticsearch 445 2022-04-05 05:05 elasticsearch/_state/segments_4x
-rw-r--r-- elasticsearch/elasticsearch 157 2022-04-05 05:06 elasticsearch/_state/_9c.fdm
-rw-r--r-- elasticsearch/elasticsearch 25688 2022-04-05 05:06 elasticsearch/_state/_9c.fdt
-rw-r--r-- elasticsearch/elasticsearch 79 2022-04-05 05:06 elasticsearch/_state/_9c_Lucene90_0.doc
-rw-r--r-- elasticsearch/elasticsearch 148 2022-04-05 05:06 elasticsearch/_state/_9c_Lucene90_0.tim
-rw-r--r-- elasticsearch/elasticsearch 73 2022-04-05 05:06 elasticsearch/_state/_9c_Lucene90_0.tip
Also the dates on those files (in UTC) look too old to be a fresh reproduction. Could you try again?
rm -rf /mq_cluster/data/elasticsearch/* && rm /usr/share/elasticsearch/logs/* && systemctl start elasticsearch.service
Frustratingly it looks like Lucene is deleting the files it claims to be corrupt so there's nothing useful here. However it is interesting that it fails so frequently for you (I count 80 failures in the 3 minutes of logs you shared). In my system I see no such problems with zlib-1.2.12
, it all works just fine.
The previous failure looks to be due to a checksum failure in _9c.fdt
, at least because we copied most of this file into _9c.cfs
but stopped before writing the checksum. The checksum in this file is correct, and on my machine both versions of zlib
return the correct checksum under all sorts of different read patterns.
Could you try reproducing this on a different physical machine? It would be useful to do that to rule out some hardware fault (bad CPU perhaps). If this is a software problem then I'd expect many similar reports, although I note that zlib-1.2.12
is only about a week old so you might be the first person to hit it.
Hi Dave, this is on a xfs file system, I also tried on ext4 just in case but that has the same result.
This is not a physical machine but a VM running on proxmox
# cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 6
model name : Common KVM processor
stepping : 1
microcode : 0x1
cpu MHz : 3292.376
cache size : 16384 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm constant_tsc nopl xtopology cpuid tsc_known_freq pni cx16 x2apic hypervisor lahf_lm cpuid_fault pti
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 6587.11
clflush size : 64
cache_alignment : 128
address sizes : 40 bits physical, 48 bits virtual
power management:
processor : 1
vendor_id : GenuineIntel
cpu family : 15
model : 6
model name : Common KVM processor
stepping : 1
microcode : 0x1
cpu MHz : 3292.376
cache size : 16384 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 4
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm constant_tsc nopl xtopology cpuid tsc_known_freq pni cx16 x2apic hypervisor lahf_lm cpuid_fault pti
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 6587.11
clflush size : 64
cache_alignment : 128
address sizes : 40 bits physical, 48 bits virtual
power management:
processor : 2
vendor_id : GenuineIntel
cpu family : 15
model : 6
model name : Common KVM processor
stepping : 1
microcode : 0x1
cpu MHz : 3292.376
cache size : 16384 KB
physical id : 0
siblings : 4
core id : 2
cpu cores : 4
apicid : 2
initial apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm constant_tsc nopl xtopology cpuid tsc_known_freq pni cx16 x2apic hypervisor lahf_lm cpuid_fault pti
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 6587.11
clflush size : 64
cache_alignment : 128
address sizes : 40 bits physical, 48 bits virtual
power management:
processor : 3
vendor_id : GenuineIntel
cpu family : 15
model : 6
model name : Common KVM processor
stepping : 1
microcode : 0x1
cpu MHz : 3292.376
cache size : 16384 KB
physical id : 0
siblings : 4
core id : 3
cpu cores : 4
apicid : 3
initial apicid : 3
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm constant_tsc nopl xtopology cpuid tsc_known_freq pni cx16 x2apic hypervisor lahf_lm cpuid_fault pti
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 6587.11
clflush size : 64
cache_alignment : 128
address sizes : 40 bits physical, 48 bits virtual
power management:
I will try again on physical hardware and post back
here is the log on physical hardware and cpu info, I cant upload the data dir (maybe its too big? 40mb)
# cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 58
model name : Intel(R) Core(TM) i7-3517UE CPU @ 1.70GHz
stepping : 9
microcode : 0x17
cpu MHz : 2800.000
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts
vmx flags : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds
bogomips : 4391.85
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 58
model name : Intel(R) Core(TM) i7-3517UE CPU @ 1.70GHz
stepping : 9
microcode : 0x17
cpu MHz : 1696.120
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 2
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts
vmx flags : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds
bogomips : 4391.85
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 58
model name : Intel(R) Core(TM) i7-3517UE CPU @ 1.70GHz
stepping : 9
microcode : 0x17
cpu MHz : 2200.000
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 2
apicid : 2
initial apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts
vmx flags : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds
bogomips : 4391.85
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 58
model name : Intel(R) Core(TM) i7-3517UE CPU @ 1.70GHz
stepping : 9
microcode : 0x17
cpu MHz : 1696.118
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 2
apicid : 3
initial apicid : 3
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts
vmx flags : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds
bogomips : 4391.85
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
The log from this test contains no errors which strongly suggests there's something wrong with that specific system rather than a problem in Elasticsearch.
Yes I believe you are right, I will look at the cpu flags available on proxmox also a new version is available in the package repo 8.1.1 so I will try that too
Frustratingly it looks like Lucene is deleting the files it claims to be corrupt so there's nothing useful here. However it is interesting that it fails so frequently for you (I count 80 failures in the 3 minutes of logs you shared). In my system I see no such problems with
zlib-1.2.12
, it all works just fine.The previous failure looks to be due to a checksum failure in
_9c.fdt
, at least because we copied most of this file into_9c.cfs
but stopped before writing the checksum. The checksum in this file is correct, and on my machine both versions ofzlib
return the correct checksum under all sorts of different read patterns.Could you try reproducing this on a different physical machine? It would be useful to do that to rule out some hardware fault (bad CPU perhaps). If this is a software problem then I'd expect many similar reports, although I note that
zlib-1.2.12
is only about a week old so you might be the first person to hit it.
Dave, here are my findings
upgrade (from 8.1.0) to 8.1.1 - same result in proxmox disable all cpu flags, enable all cpu flags (flags that can be enabled) - same result in proxmox enable NUMA - same result in proxmox change cpu (from kvm64) to qemu64 - same result Finally in proxmox change cpu (from kvm64) to Haswell - issue resolved!
Thanks for your help
I am closing this issue as its cause was environmental so there's no action for the Elasticsearch team to take.
running ES 7.1.2 on archlinux, after zlib was upgraded from 1.2.11 to 1.2.12 the service did not start
Taking this opportunity to upgrade to the latest elasticsearch I installed ES 8.1.0 single node with no existing data and got the same issue, after downgrading zlib both versions of ES work fine
sample stack trace