fluid-cloudnative / fluid

Fluid, elastic data abstraction and acceleration for BigData/AI applications in cloud. (Project under CNCF)
https://fluid-cloudnative.github.io/
Apache License 2.0
1.58k stars 949 forks source link

[BUG] multiple datasets and multiple alluxioruntimes failed to work concurrently #4186

Closed moting9 closed 1 week ago

moting9 commented 1 week ago

What is your environment(Kubernetes version, Fluid version, etc.) kubernetes: v1.30.2 fluid: v1.0.0 alluxioruntime: 2.9.0

Describe the bug Dear experts. I try to use fluid to deploy AIGC model files. I set up a minio server , create a "models" bucket and put 3 model files (embed/rerank/llm) in the bucket. mc ls myminio/models [2024-06-28 09:02:35 UTC] 0B models--BAAI--bge-base-en-v1.5/ [2024-06-28 09:02:35 UTC] 0B models--BAAI--bge-reranker-base/ [2024-06-28 09:02:35 UTC] 0B models--Intel--neural-chat-7b-v3-3/

I defined 3 datasets, each dataset points to a model directory

cat embed_ds.yaml

_apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: embed spec: mounts:

cat rerank_ds.yaml

_apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: rerank spec: mounts:

cat llm_ds.yaml

_apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: llm spec: mounts:

Then I defined 3 runtimes to manage the 3 datasets, but only 1 runtime can run, the other 2 are pending.

cat embed_rt.yaml

_apiVersion: data.fluid.io/v1alpha1 kind: AlluxioRuntime metadata: name: embed spec: replicas: 1 tieredstore: levels:

cat llm_rt.yaml

_apiVersion: data.fluid.io/v1alpha1 kind: AlluxioRuntime metadata: name: llm spec: replicas: 1 tieredstore: levels:

k get pods NAME READY STATUS RESTARTS AGE embed-master-0 2/2 Running 0 28m embed-worker-0 2/2 Running 0 27m llm-master-0 2/2 Running 0 28m llm-worker-0 0/2 Pending 0 27m rerank-master-0 2/2 Running 0 28m rerank-worker-0 0/2 Pending 0 27m

k get dataset NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE embed 418.35MiB 0.00B 1.86GiB 0.0% Bound 28m llm 26.98GiB NotBound 28m rerank 2.09GiB NotBound 28m

Fluid should be very helpful to deploy AIGC model files for inference. not sure whether my usage above is correct. Is dataset and runtime 1and1 map relationship? How to make multiple datasets and runtimes work as expected. Any limitation for concurrent datasets and runtime number in a system? Thanks!

What you expect to happen: multiple datasets and runtime could work and mounted by different pods.

How to reproduce it pls refer to bug desc Additional Information

moting9 commented 1 week ago

placement: "Shared" could support multi datasets. close it.