fluid-cloudnative / fluid

Fluid, elastic data abstraction and acceleration for BigData/AI applications in cloud. (Project under CNCF)
https://fluid-cloudnative.github.io/
Apache License 2.0
1.68k stars 960 forks source link

Pod stuck in "ContainerCreating" status when using Fluid+JuiceFs in a single-node k8s environment #3766

Closed whygyc closed 8 months ago

whygyc commented 8 months ago

What is your environment(Kubernetes version, Fluid version, etc.)

# helm list
NAME            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART           APP VERSION
fluid           default         1               2024-03-14 11:18:34.324181662 +0800 CST deployed        fluid-0.9.3     0.9.3-e0184cf
jfsdemo-dataset default         1               2024-03-19 01:15:46.227497126 +0800 CST deployed        juicefs-0.2.16  v1.0.0

I have been following this tutorial and attempting to use Fluid+JuiceFs in a single-node k8s environment. https://github.com/fluid-cloudnative/fluid/blob/master/docs/zh/samples/juicefs/juicefs_runtime.md I have successfully completed the previous steps, but I encountered an issue in the final step when creating the pod. It remains in the ContainerCreating state.

# kubectl get po -A
NAMESPACE      NAME                                         READY   STATUS              RESTARTS         AGE
default        demo-app                                     0/1     ContainerCreating   0                13m

# kubectl describe pod demo-app | tail -n 5
Events:
  Type     Reason       Age                From               Message
  ----     ------       ----               ----               -------
  Normal   Scheduled    47s                default-scheduler  Successfully assigned default/demo-app to 10.0.2.15
  Warning  FailedMount  15s (x7 over 47s)  kubelet            MountVolume.MountDevice failed for volume "default-jfsdemo-dataset" : rpc error: code = Unknown desc = NodeStageVolume: can't get node 10.0.2.15: Get "https://127.0.0.1:6443/api/v1/nodes/10.0.2.15": dial tcp 127.0.0.1:6443: connect: connection refused

# kubectl get dataset -A
NAMESPACE   NAME              UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
default     jfsdemo-dataset   4.00KiB                   4.00GiB                              Bound   39h

Here is my pod.yaml:

# cat sample-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: demo-app
spec:
  containers:
    - name: demo
      image: nginx:latest
      imagePullPolicy: IfNotPresent
      volumeMounts:
        - mountPath: /data
          name: demo
  volumes:
    - name: demo
      persistentVolumeClaim:
        claimName: jfsdemo-dataset

k8s node information:

# kubectl get node -o wide
NAME        STATUS   ROLES    AGE    VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION                CONTAINER-RUNTIME
10.0.2.15   Ready    master   463d   v1.25.3   10.0.2.15     <none>        CentOS Linux 7 (Core)   5.4.228-1.el7.elrepo.x86_64   containerd://1.6.8

If I remove the volume section in the sample-pod.yaml file, the pod can be created normally. I am not sure if this is related to Fluid. If there are any test methods, please let me know. I can confirm that the port number is correct as I can successfully access it using: wget --header "Authorization: Bearer <token>" https://127.0.0.1:6443/api/v1/nodes/10.0.2.15

Any suggestions would be greatly appreciated. Thank you.

whygyc commented 8 months ago

The error Get "https://127.0.0.1:6443/api/v1/nodes/10.0.2.15": dial tcp 127.0.0.1:6443: connect: connection refused, which typically occurs when the kubelet is unavailable. However, all commands in my environment are functioning properly, and I can create pods without datasets successfully.

TrafalgarZZZ commented 8 months ago

@whygyc it seems like the same problem with #3417

You can get more information and solution from my issue comment here https://github.com/fluid-cloudnative/fluid/issues/3417#issuecomment-1691532950

whygyc commented 8 months ago

Thank you for your response. After investigating in the past few days, I found that the issue lies with csi-nodeplugin-fluid (the same error logs can be seen through kubectl logs -n fluid-system csi-nodeplugin-fluid-wr74l -c plugins). Within the plugins container, it is not possible to directly access the IP address 127.0.0.1 because it belongs to the internal network space of the container.

To resolve this, it is necessary to modify the DaemonSet/csi-nodeplugin-fluid by adding hostNetwork: true and adjusting the port numbers for two listeners. After making these changes, it will be able to correctly access 127.0.0.1 on the host, and pod creation will be successful.

# kubectl edit DaemonSet/csi-nodeplugin-fluid -n fluid-system
      ......    
        - --pprof-addr=:6061
        - --metrics-addr=:8081
      ......
      hostNetwork: true
      ......

Regarding hostNetwork: true for DaemonSet/csi-nodeplugin-fluid, I noticed that it seems to be configured in the YAML. However, I am not familiar with Helm, so I am unsure how to modify this configuration during Helm installation. https://github.com/fluid-cloudnative/fluid/blob/05635698c0a0f8c3381a284240b56bcf0694f9d9/charts/fluid/fluid/values.yaml#L35

Thank you for the reminder. I did not notice the helm upgrade fluid --set csi.config.hostNetwork=true fluid/fluid mentioned in the documentation. You are correct, but it seems that the documentation does not include the section on modifying ports. In my environment, using hostNetwork directly will lead to port conflicts. This is also an issue.