Open rkoosaar opened 19 hours ago
Logs you provided are just from request failures due to past corruption alarm. Can you provide the logs from time the corruption happen?
This is what the log bundle had from the time I generated it. Not sure what happened then or why it starts from that time. Maybe it got overwritten since its constantly saying cluster corrupted. do you think any other log file from the bundle might help? or me uploading the whole bundle?
This is what the log bundle had from the time I generated it.
Could you provide the complete log of all etcd instances?
or me uploading the whole bundle?
Yes, please. Is it possible to upload all the db files (under the ${data_dir}/member/snap/db) If it doesn't have any sensitive data?
Bug report criteria
What happened?
Hi folks, I have a really odd issue that I'm troubleshooting. I have a 3 node Talos (1.8.3) cluster at home where etcd (3.5.16) keeps getting corrupted after a while. Initially I thought it could be a disk related issue. So I bought brand new disks and swapped them around. I installed a new cluster last night (around 8pm) and when I woke up this morning (8am) cluster was not working and etcd was reporting cluster corrupted. Looking at the logs, it seems something happened around 6am, but I'm unable to work out what the cause is. So far I have redeployed the cluster in the past week 4 times and every time etcd has ended up corrupted. Any help/guidance to troubleshoot this would be much appreciated.
What did you expect to happen?
Cluster not the get corrupted
How can we reproduce it (as minimally and precisely as possible)?
I'm not 100% sure how this can be reproduced in your env as I don't fully understand why this happens
Anything else we need to know?
I have actually saved a log bundles from all 3 cluster nodes using
talosctl -n node_ip support
I'm just not sure which log files would be helpful. If you could advise which logs are needed I can provide them: the log bundle has folders: kubernetes-logs service-logs (etcd.log file here, I pasted it in the relevant log section) and separately log files: controller-runtime.log dmesg.logEtcd version (please run commands below)
here is the output of EtcdConfigs.etcd.talos.dev file from node1
Etcd configuration (command line flags or environment variables)
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
I'm not 100% sure how I can run the below commands on talos
Relevant log output