Open psheorangithub opened 1 year ago
Are you deploying kubeflow with kubeflow-manifests? Have you checked the disk usage of the volume? The default volume is 10Gi, without further information I can only guess maybe after 10-15 days the disk is full?
Yes, I have deployed using kubeflow manifest. To be specific, https://github.com/kubeflow/manifests/tree/v1.6.0. Yes, i did check the disk usage already. The PVC used by authservice is 10G and only file in that is "data-db" which was only 5 MB a the time of issue. Overall volumes usage also looks good.
We are experiencing the same issue in our environment as well. The "Failed to Save State in Store: Input/Output Error" error keeps showing up for the authservice pod, even though all other components seem to be running fine. Upon launching the Kubeflow environment and adding five users, we have encountered a recurring 403 error on the Kubeflow Dex login page, even when no users were logged in.
Environment:
Pod Information:
Issue Details:
authservice-0
pod within the istio-system
namespace shows no anomalies in resource usage. CPU and memory consumption appear to be normal.# kubectl top pod authservice-0 -n istio-system
NAME CPU(cores) MEMORY(bytes)
authservice-0 1m 3Mi
data.db
file appears to be intact.# ls -lh /export/kubernetes/istio-system-authservice-pvc-pvc-3e8dd897-4478-40c5-a007-e1d1aa55f734
total 24K
-rw-r--r-- 1 systemd-network tss 32K Jul 24 05:25 data.db
Error Logs:
# kubectl logs authservice-0 -n istio-system
time="2023-07-24T05:25:20Z" level=info msg="Starting readiness probe at 8081"
time="2023-07-24T05:25:20Z" level=info msg="No USERID_TOKEN_HEADER specified, using 'kubeflow-userid-token' as default."
time="2023-07-24T05:25:20Z" level=info msg="No SERVER_HOSTNAME specified, using '' as default."
time="2023-07-24T05:25:20Z" level=info msg="No SERVER_PORT specified, using '8080' as default."
time="2023-07-24T05:25:20Z" level=info msg="No SESSION_MAX_AGE specified, using '86400' as default."
time="2023-07-24T05:25:20Z" level=info msg="Starting web server at :8080"
time="2023-07-24T05:47:51Z" level=error msg="Failed to save state in store: error trying to save session: input/output error" ip=192.168.200.145 request=/
time="2023-07-24T05:48:29Z" level=error msg="Failed to save state in store: error trying to save session: input/output error" ip=192.168.200.145 request=/
time="2023-07-24T05:50:10Z" level=error msg="Failed to save state in store: error trying to save session: input/output error" ip=192.168.200.145 request=/
time="2023-07-24T05:50:25Z" level=error msg="Failed to save state in store: error trying to save session: input/output error" ip=192.168.200.145 request=/
time="2023-07-24T05:55:03Z" level=error msg="Failed to save state in store: error trying to save session: input/output error" ip=192.168.200.145 request=/
time="2023-07-24T05:55:04Z" level=error msg="Failed to save state in store: error trying to save session: input/output error" ip=192.168.200.145 request=/
Additional Content:
After setting the log level of oidc-authservice
to DEBUG, I rechecked the logs when the error occurred again. I discovered that the error is related to boltstore/reaper, which is responsible for releasing unnecessary resources, rather than using boltdb for session management.
2023/08/01 03:22:10 boltstore: remove expired sessions error: input/output error
time="2023-08-01T03:22:57Z" level=warning msg="Request doesn't have a valid session." ip=192.168.200.15 request=/logout
time="2023-08-01T03:22:57Z" level=error msg="Failed to save state in store: error trying to save session: input/output error" ip=192.168.200.15 request=/
We have integrated kubeflow with OIDC flow(Heracles+LDAP)We are unable to login to kubeflow UI. GUI throws below error.
Access to kubeflow.aiwb-enc-data-cpu1.uscentral-prd-az3.k8s.int was deniedYou don't have authorization to view this page. HTTP ERROR 403
While checking the authservice pod logs, I see below error. It happens every couple of days.
2023/03/15 14:08:40 boltstore: remove expired sessions error: input/output error time="2023-03-15T14:06:29Z" level=error msg="Failed to save state in store: error trying to save session: input/output error" ip= request=/
The issue resolves after restarting authservice pod but it re-appear after every 10-15 days. We have checked the underlying PVC status, it looks healthy.
Can someone look into it and suggest what could be the cause?