fluid-cloudnative / fluid

Fluid, elastic data abstraction and acceleration for BigData/AI applications in cloud. (Project under CNCF)
https://fluid-cloudnative.github.io/
Apache License 2.0
1.63k stars 957 forks source link

Performance not meet the expectation in Serverless usecase #1789

Closed lizzzcai closed 2 years ago

lizzzcai commented 2 years ago

Hi Fluid team, I am trying to follow the serverless demo here.

Instead of using data from web, my dataset is from S3. (A 1.3 GB model BERT-Large, Uncased (Whole Word Masking) )

My k8s version is 1.21.10 with 3 nodes. I labelled two of them with tenantId: tenant1. Fluid version: v0.7.

below is the example of my dataset:

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: s3-data
spec:
  mounts:
    - mountPoint: "s3://<my-bucket>/folder"
      name: s3
      options:
        alluxio.underfs.s3.region: "eu-central-1"
        alluxio.underfs.s3.endpoint: "s3.amazonaws.com"
        alluxio.underfs.s3.inherit.acl: "false"
        alluxio.underfs.s3.socket.timeout: 500sec
        alluxio.underfs.s3.request.timeout: 5min
        alluxio.master.mount.table.root.readonly: "true"
      encryptOptions:
        - name: aws.accessKeyId
          valueFrom:
            secretKeyRef:
              name: mysecret
              key: aws.accessKeyId
        - name: aws.secretKey
          valueFrom:
            secretKeyRef:
              name: mysecret
              key: aws.secretKey
  accessModes:
    - ReadOnlyMany
  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
            - key: tenantId
              operator: In
              values:
                - "tenant1"
  placement: "Shared"
---
apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
  name: s3-data
spec:
  replicas: 2
  properties:
    alluxio.web.ui.enabled: "true"
    alluxio.user.ufs.block.read.location.policy: alluxio.client.block.policy.LocalFirstAvoidEvictionPolicy
    alluxio.user.block.size.bytes.default: 256MB
    alluxio.user.streaming.reader.chunk.size.bytes: 256MB
    alluxio.user.local.reader.chunk.size.bytes: 256MB
    alluxio.worker.network.reader.buffer.size: 256MB
    alluxio.user.streaming.data.timeout: 300sec
  tieredstore:
    levels:
      - mediumtype: MEM
        path: /dev/shm
        quota: 4Gi
        high: "0.95"
        low: "0.7"

Below is my deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello
spec:
  selector:
    matchLabels:
      app: hello
  replicas: 1
  template:
    metadata:
      labels:
        app: hello
        serverless.fluid.io/inject: "true"
    spec:
      nodeSelector:
        tenantId: tenant1
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - hello
            topologyKey: "kubernetes.io/hostname"
      containers:
        - name: hello
          image: lizzzcai/helloworld-go-fluid:v1
          ports:
            - name: http1
              containerPort: 8080
          env:
            - name: TARGET
              value: "World"
          volumeMounts:
            - mountPath: /data
              name: data
              readOnly: true
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: s3-data

After applying my dataset, the logs are fine from the dataset master

2022-04-19 10:49:31,172 INFO  WebServer - Alluxio Master Web service started @ /0.0.0.0:25070
2022-04-19 10:49:31,182 INFO  AlluxioMasterProcess - Alluxio master version 2.7.2 started. bindAddress=0.0.0.0/0.0.0.0:22233, connectAddress=10.250.141.108:22233, webAddress=/0.0.0.0:25070
2022-04-19 10:49:31,957 INFO  ExtensionFactoryRegistry - Loading core jars from /opt/alluxio-2.7.2/lib
2022-04-19 10:49:31,995 INFO  ExtensionFactoryRegistry - Loading extension jars from /opt/alluxio-2.7.2/extensions
2022-04-19 10:49:33,035 INFO  MountTable - Mounting "s3://<my-bucket>/folder" at /s3
2022-04-19 10:49:40,602 INFO  DefaultBlockMaster - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.250.68.13, containerHost=, rpcPort=23043, dataPort=23043, webPort=24173, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.250.68.13, rack=null)} id: 6264663071979105225
2022-04-19 10:49:40,697 INFO  JvmSpaceReviewer - 16391174984 bytes available on master. The register request with 0 blocks is estimated to need 0 bytes. 
2022-04-19 10:49:40,700 INFO  RegisterLeaseManager - Granted lease to worker 6264663071979105225
2022-04-19 10:49:41,029 INFO  RegisterStreamObserver - Register stream completed on the client side
2022-04-19 10:49:41,031 INFO  DefaultBlockMaster - Found 0 blocks to remove from the worker
2022-04-19 10:49:41,032 INFO  DefaultBlockMaster - Worker successfully registered: MasterWorkerInfo{id=6264663071979105225, workerAddress=WorkerNetAddress{host=10.250.68.13, containerHost=, rpcPort=23043, dataPort=23043, webPort=24173, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.250.68.13, rack=null)}, capacityBytes=4294967296, usedBytes=0, lastUpdatedTimeMs=1650365381032, blocks=HashSet{0 entries}, lostStorage={}}
2022-04-19 10:49:46,674 INFO  DefaultBlockMaster - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.250.141.108, containerHost=, rpcPort=23043, dataPort=23043, webPort=24173, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.250.141.108, rack=null)} id: 1711435128610813895
2022-04-19 10:49:46,955 INFO  JvmSpaceReviewer - 16418826112 bytes available on master. The register request with 0 blocks is estimated to need 0 bytes. 
2022-04-19 10:49:46,957 INFO  RegisterLeaseManager - Granted lease to worker 1711435128610813895
2022-04-19 10:49:47,365 INFO  RegisterStreamObserver - Register stream completed on the client side
2022-04-19 10:49:47,370 INFO  DefaultBlockMaster - Found 0 blocks to remove from the worker
2022-04-19 10:49:47,379 INFO  DefaultBlockMaster - Worker successfully registered: MasterWorkerInfo{id=1711435128610813895, workerAddress=WorkerNetAddress{host=10.250.141.108, containerHost=, rpcPort=23043, dataPort=23043, webPort=24173, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.250.141.108, rack=null)}, capacityBytes=4294967296, usedBytes=0, lastUpdatedTimeMs=1650365387379, blocks=HashSet{0 entries}, lostStorage={}}

Then I try to deploy my deployment. below are the logs.

❯ k logs -n demo hello-5cdcc6fd66-l2dq4 -c hello
Begin loading models at 11:06:36

real    1m29.344s
user    0m0.001s
sys     0m2.346s
Finish loading models at 11:08:05
2022/04/19 11:08:05 helloworld: starting server...
2022/04/19 11:08:05 helloworld: listening on port 8080
❯ k logs -n demo hello-5cdcc6fd66-l2dq4 -c fluid-fuse
Exception in thread "main" java.lang.RuntimeException: Invalid property key ALLUXIO_CLIENT_HOSTNAME
        at alluxio.conf.InstancedConfiguration.lookupRecursively(InstancedConfiguration.java:454)
        at alluxio.conf.InstancedConfiguration.lookup(InstancedConfiguration.java:427)
        at alluxio.conf.InstancedConfiguration.isResolvable(InstancedConfiguration.java:170)
        at alluxio.conf.InstancedConfiguration.isSet(InstancedConfiguration.java:179)
        at alluxio.conf.AlluxioConfiguration.getOrDefault(AlluxioConfiguration.java:64)
        at alluxio.cli.GetConf.getConfImpl(GetConf.java:190)
        at alluxio.cli.GetConf.getConf(GetConf.java:147)
        at alluxio.cli.GetConf.main(GetConf.java:274)
umount: /var/lib/docker/runtime-mnt/alluxio/demo/s3-data/alluxio-fuse: not mounted
Starting AlluxioFuse process: mounting alluxio path "/" to local mount point "/var/lib/docker/runtime-mnt/alluxio/demo/s3-data/alluxio-fuse" with options="big_writes,kernel_cache,ro,max_read=131072,attr_timeout=7200,entry_timeout=7200,nonempty,allow_other"
OpenJDK 64-Bit Server VM warning: If the number of processors is expected to increase from one, then you should configure the number of parallel GC threads appropriately using -XX:ParallelGCThreads=N
2022-04-19 11:06:31,366 INFO  AlluxioFuse - Alluxio version: 2.7.2-0703ba35241e96f8405614dc20e7c7db09167abb
2022-04-19 11:06:31,898 INFO  MetricsSystem - Starting sinks with config: {}.
2022-04-19 11:06:31,908 INFO  MetricsHeartbeatContext - Created metrics heartbeat with ID app-6474937592179830575. This ID will be used for identifying info from the client. It can be set manually through the alluxio.user.app.id property
2022-04-19 11:06:32,187 INFO  NettyUtils - EPOLL_MODE is available
2022-04-19 11:06:32,911 INFO  TieredIdentityFactory - Initialized tiered identity TieredIdentity(node=10.250.141.108, rack=null)
2022-04-19 11:06:32,940 INFO  NativeLibraryLoader - Loaded lib by jar from path /tmp/libjnifuse104465719671487453.so.
2022-04-19 11:06:33,416 INFO  Reflections - Reflections took 445 ms to scan 1 urls, producing 58 keys and 194 values 
2022-04-19 11:06:33,455 INFO  AlluxioFuse - Mounting AlluxioJniFuseFileSystem: mount point="/var/lib/docker/runtime-mnt/alluxio/demo/s3-data/alluxio-fuse", OPTIONS="[-obig_writes, -okernel_cache, -oro, -omax_read=131072, -oattr_timeout=7200, -oentry_timeout=7200, -ononempty, -oallow_other, -omax_write=131072]"
2022-04-19 11:06:33,456 INFO  AbstractFuseFileSystem - Mounting /var/lib/docker/runtime-mnt/alluxio/demo/s3-data/alluxio-fuse: blocking=true, debug=false, fuseOpts="[-obig_writes, -okernel_cache, -oro, -omax_read=131072, -oattr_timeout=7200, -oentry_timeout=7200, -ononempty, -oallow_other, -omax_write=131072]"
fuse: max_idle_threads: 64
fuse: max_idle_threads: 64
alluxio-fuse on /var/lib/docker/runtime-mnt/alluxio/demo/s3-data/alluxio-fuse type fuse.alluxio-fuse (ro,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other,max_read=131072)
succeed in checking mount point /var/lib/docker/runtime-mnt/alluxio/demo/s3-data
2022-04-19 11:06:37,743 WARN  AlluxioJniFuseFileSystem - Fuse.Read(path=/s3/wwm_uncased_L-24_H-1024_A-16/bert_config.json,buf=java.nio.DirectByteBuffer[pos=314 lim=314 cap=4096],size=4096,offset=0) returned 314 in 1587 ms (>=1000 ms)
fuse: max_idle_threads: 64
fuse: max_idle_threads: 64
2022-04-19 11:07:05,633 WARN  AlluxioJniFuseFileSystem - Fuse.Read(path=/s3/wwm_uncased_L-24_H-1024_A-16/bert_model.ckpt.data-00000-of-00001,buf=java.nio.DirectByteBuffer[pos=131072 lim=131072 cap=131072],size=131072,offset=395444224) returned 131072 in 1099 ms (>=1000 ms)

Checking the dataset and runtime, the data were cached.

❯ kubectl get alluxio,dataset -n demo
NAME                                   MASTER PHASE   WORKER PHASE   FUSE PHASE   AGE
alluxioruntime.data.fluid.io/s3-data   Ready          Ready          Ready        28m

NAME                            UFS TOTAL SIZE   CACHED    CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
dataset.data.fluid.io/s3-data   1.25GiB          1.25GiB   8.00GiB          100.0%              Bound   28m

I delete the deployment and redeploy it again.

❯ k logs -n demo hello-5cdcc6fd66-4l45p -c hello
Begin loading models at 11:22:11

real    1m7.119s
user    0m0.000s
sys     0m1.922s
Finish loading models at 11:23:18
2022/04/19 11:23:18 helloworld: starting server...
2022/04/19 11:23:18 helloworld: listening on port 8080
❯ k logs -n demo hello-5cdcc6fd66-4l45p -c fluid-fuse
Exception in thread "main" java.lang.RuntimeException: Invalid property key ALLUXIO_CLIENT_HOSTNAME
        at alluxio.conf.InstancedConfiguration.lookupRecursively(InstancedConfiguration.java:454)
        at alluxio.conf.InstancedConfiguration.lookup(InstancedConfiguration.java:427)
        at alluxio.conf.InstancedConfiguration.isResolvable(InstancedConfiguration.java:170)
        at alluxio.conf.InstancedConfiguration.isSet(InstancedConfiguration.java:179)
        at alluxio.conf.AlluxioConfiguration.getOrDefault(AlluxioConfiguration.java:64)
        at alluxio.cli.GetConf.getConfImpl(GetConf.java:190)
        at alluxio.cli.GetConf.getConf(GetConf.java:147)
        at alluxio.cli.GetConf.main(GetConf.java:274)
umount: /var/lib/docker/runtime-mnt/alluxio/demo/s3-data/alluxio-fuse: not mounted
Starting AlluxioFuse process: mounting alluxio path "/" to local mount point "/var/lib/docker/runtime-mnt/alluxio/demo/s3-data/alluxio-fuse" with options="big_writes,kernel_cache,ro,max_read=131072,attr_timeout=7200,entry_timeout=7200,nonempty,allow_other"
OpenJDK 64-Bit Server VM warning: If the number of processors is expected to increase from one, then you should configure the number of parallel GC threads appropriately using -XX:ParallelGCThreads=N
2022-04-19 11:22:07,004 INFO  AlluxioFuse - Alluxio version: 2.7.2-0703ba35241e96f8405614dc20e7c7db09167abb
2022-04-19 11:22:07,474 INFO  MetricsSystem - Starting sinks with config: {}.
2022-04-19 11:22:07,482 INFO  MetricsHeartbeatContext - Created metrics heartbeat with ID app-7986899754083014497. This ID will be used for identifying info from the client. It can be set manually through the alluxio.user.app.id property
2022-04-19 11:22:07,657 INFO  NettyUtils - EPOLL_MODE is available
2022-04-19 11:22:08,133 INFO  TieredIdentityFactory - Initialized tiered identity TieredIdentity(node=10.250.141.108, rack=null)
2022-04-19 11:22:08,147 INFO  NativeLibraryLoader - Loaded lib by jar from path /tmp/libjnifuse7254716869503402853.so.
2022-04-19 11:22:08,502 INFO  Reflections - Reflections took 335 ms to scan 1 urls, producing 58 keys and 194 values 
2022-04-19 11:22:08,547 INFO  AlluxioFuse - Mounting AlluxioJniFuseFileSystem: mount point="/var/lib/docker/runtime-mnt/alluxio/demo/s3-data/alluxio-fuse", OPTIONS="[-obig_writes, -okernel_cache, -oro, -omax_read=131072, -oattr_timeout=7200, -oentry_timeout=7200, -ononempty, -oallow_other, -omax_write=131072]"
2022-04-19 11:22:08,548 INFO  AbstractFuseFileSystem - Mounting /var/lib/docker/runtime-mnt/alluxio/demo/s3-data/alluxio-fuse: blocking=true, debug=false, fuseOpts="[-obig_writes, -okernel_cache, -oro, -omax_read=131072, -oattr_timeout=7200, -oentry_timeout=7200, -ononempty, -oallow_other, -omax_write=131072]"
fuse: max_idle_threads: 64
fuse: max_idle_threads: 64
alluxio-fuse on /var/lib/docker/runtime-mnt/alluxio/demo/s3-data/alluxio-fuse type fuse.alluxio-fuse (ro,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other,max_read=131072)
succeed in checking mount point /var/lib/docker/runtime-mnt/alluxio/demo/s3-data
2022-04-19 11:22:11,870 WARN  AlluxioFileInStream - Failed to read block 16777216 of file /s3/wwm_uncased_L-24_H-1024_A-16/bert_config.json from worker WorkerNetAddress{host=10.250.141.108, containerHost=, rpcPort=23043, dataPort=23043, webPort=24173, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.250.141.108, rack=null)}. This worker will be skipped for future read operations, will retry: java.io.FileNotFoundException: /dev/shm/demo/s3-data/alluxioworker/16777216 (No such file or directory).
2022-04-19 11:22:13,495 WARN  AlluxioJniFuseFileSystem - Fuse.Read(path=/s3/wwm_uncased_L-24_H-1024_A-16/bert_config.json,buf=java.nio.DirectByteBuffer[pos=314 lim=314 cap=4096],size=4096,offset=0) returned 314 in 1706 ms (>=1000 ms)
fuse: max_idle_threads: 64
2022-04-19 11:22:13,498 WARN  AlluxioFileInStream - Failed to read block 50331648 of file /s3/wwm_uncased_L-24_H-1024_A-16/bert_model.ckpt.data-00000-of-00001 from worker WorkerNetAddress{host=10.250.141.108, containerHost=, rpcPort=23043, dataPort=23043, webPort=24173, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.250.141.108, rack=null)}. This worker will be skipped for future read operations, will retry: java.io.FileNotFoundException: /dev/shm/demo/s3-data/alluxioworker/50331648 (No such file or directory).
fuse: max_idle_threads: 64
fuse: max_idle_threads: 64
2022-04-19 11:23:18,516 WARN  AlluxioFileInStream - Failed to read block 67108864 of file /s3/wwm_uncased_L-24_H-1024_A-16/bert_model.ckpt.meta from worker WorkerNetAddress{host=10.250.141.108, containerHost=, rpcPort=23043, dataPort=23043, webPort=24173, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.250.141.108, rack=null)}. This worker will be skipped for future read operations, will retry: java.io.FileNotFoundException: /dev/shm/demo/s3-data/alluxioworker/67108864 (No such file or directory).
2022-04-19 11:23:18,645 WARN  AlluxioFileInStream - Failed to read block 33554432 of file /s3/wwm_uncased_L-24_H-1024_A-16/vocab.txt from worker WorkerNetAddress{host=10.250.141.108, containerHost=, rpcPort=23043, dataPort=23043, webPort=24173, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.250.141.108, rack=null)}. This worker will be skipped for future read operations, will retry: java.io.FileNotFoundException: /dev/shm/demo/s3-data/alluxioworker/33554432 (No such file or directory).
2022-04-19 11:23:18,772 WARN  AlluxioFileInStream - Failed to read block 83886080 of file /s3/wwm_uncased_L-24_H-1024_A-16/bert_model.ckpt.index from worker WorkerNetAddress{host=10.250.141.108, containerHost=, rpcPort=23043, dataPort=23043, webPort=24173, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.250.141.108, rack=null)}. This worker will be skipped for future read operations, will retry: java.io.FileNotFoundException: /dev/shm/demo/s3-data/alluxioworker/83886080 (No such file or directory).

As you can see, the improvement is not so significant. Even I can see some warning on reading block in the fuse sidecar.

2022-04-19 11:22:13,498 WARN  AlluxioFileInStream - Failed to read block 50331648 of file /s3/wwm_uncased_L-24_H-1024_A-16/bert_model.ckpt.data-00000-of-00001 from worker WorkerNetAddress{host=10.250.141.108, containerHost=, rpcPort=23043, dataPort=23043, webPort=24173, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.250.141.108, rack=null)}. This worker will be skipped for future read operations, will retry: java.io.FileNotFoundException: /dev/shm/demo/s3-data/alluxioworker/50331648 (No such file or directory).
fuse: max_idle_threads: 64
fuse: max_idle_threads: 64
2022-04-19 11:23:18,516 WARN  AlluxioFileInStream - Failed to read block 67108864 of file /s3/wwm_uncased_L-24_H-1024_A-16/bert_model.ckpt.meta from worker WorkerNetAddress{host=10.250.141.108, containerHost=, rpcPort=23043, dataPort=23043, webPort=24173, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.250.141.108, rack=null)}. This worker will be skipped for future read operations, will retry: java.io.FileNotFoundException: /dev/shm/demo/s3-data/alluxioworker/67108864 (No such file or directory).
2022-04-19 11:23:18,645 WARN  AlluxioFileInStream - Failed to read block 33554432 of file /s3/wwm_uncased_L-24_H-1024_A-16/vocab.txt from worker WorkerNetAddress{host=10.250.141.108, containerHost=, rpcPort=23043, dataPort=23043, webPort=24173, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.250.141.108, rack=null)}. This worker will be skipped for future read operations, will retry: java.io.FileNotFoundException: /dev/shm/demo/s3-data/alluxioworker/33554432 (No such file or directory).
2022-04-19 11:23:18,772 WARN  AlluxioFileInStream - Failed to read block 83886080 of file /s3/wwm_uncased_L-24_H-1024_A-16/bert_model.ckpt.index from worker WorkerNetAddress{host=10.250.141.108, containerHost=, rpcPort=23043, dataPort=23043, webPort=24173, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.250.141.108, rack=null)}. This worker will be skipped for future read operations, will retry: java.io.FileNotFoundException: /dev/shm/demo/s3-data/alluxioworker/83886080 (No such file or directory).
  1. Is there anything wrong in my setting? or is there any setting that can help to improve the performance of this in the dataset?
  2. In addition, the size of the alluxio fuse sidecar is very large (alluxio/alluxio-dev:2.72 is 3.2 GB), I think with this image size it is hard to meet the serverless requirement.
cheyang commented 2 years ago

Thank you for opening this issue. Could you please collect the logs by following https://github.com/fluid-cloudnative/fluid/blob/master/docs/zh/userguide/troubleshooting.md . Thanks.

lizzzcai commented 2 years ago

diagnose_fluid_1650373486.zip @cheyang , please find attached logs. I removed the info of the s3 bucket and addon some info which are not collected successfully by the script.

cheyang commented 2 years ago

@ssz1997 Could you also help take a look at this issue? Thanks. FYI @TrafalgarZZZ

TrafalgarZZZ commented 2 years ago

Hi @lizzzcai . I've noticed that you set some properties of Alluxio:

 properties:
    alluxio.web.ui.enabled: "true"
    alluxio.user.ufs.block.read.location.policy: alluxio.client.block.policy.LocalFirstAvoidEvictionPolicy
    alluxio.user.block.size.bytes.default: 256MB
    alluxio.user.streaming.reader.chunk.size.bytes: 256MB
    alluxio.user.local.reader.chunk.size.bytes: 256MB
    alluxio.worker.network.reader.buffer.size: 256MB
    alluxio.user.streaming.data.timeout: 300sec

Would you explain why this is necessary for your scenario? In my opinion, these properties seems to large (e.g., block size for 256MB) for your scenario. Did you test with default properties provided by Fluid?

lizzzcai commented 2 years ago

Hi @TrafalgarZZZ , I have tried the default setting at the beginning, but I can see the error of Failed to read block. Then I found out some of this parameters from other issue (here I think https://github.com/fluid-cloudnative/fluid/pull/612) and see if it can help. In my case, the model is a single 1.3GB file.

Here I tried to run it without all the properties, below are the logs:

umount: /var/lib/docker/runtime-mnt/alluxio/demo/s3-data/alluxio-fuse: not mounted
Starting AlluxioFuse process: mounting alluxio path "/" to local mount point "/var/lib/docker/runtime-mnt/alluxio/demo/s3-data/alluxio-fuse" with options="big_writes,kernel_cache,ro,max_read=131072,attr_timeout=7200,entry_timeout=7200,nonempty,allow_other"
OpenJDK 64-Bit Server VM warning: If the number of processors is expected to increase from one, then you should configure the number of parallel GC threads appropriately using -XX:ParallelGCThreads=N
2022-04-20 04:55:05,815 INFO  AlluxioFuse - Alluxio version: 2.7.2-0703ba35241e96f8405614dc20e7c7db09167abb
2022-04-20 04:55:06,289 INFO  MetricsSystem - Starting sinks with config: {}.
2022-04-20 04:55:06,298 INFO  MetricsHeartbeatContext - Created metrics heartbeat with ID app-2532123806612390426. This ID will be used for identifying info from the client. It can be set manually through the alluxio.user.app.id property
2022-04-20 04:55:06,492 INFO  NettyUtils - EPOLL_MODE is available
2022-04-20 04:55:07,034 INFO  TieredIdentityFactory - Initialized tiered identity TieredIdentity(node=10.250.141.108, rack=null)
2022-04-20 04:55:07,053 INFO  NativeLibraryLoader - Loaded lib by jar from path /tmp/libjnifuse659397235779862725.so.
2022-04-20 04:55:07,408 INFO  Reflections - Reflections took 339 ms to scan 1 urls, producing 58 keys and 194 values 
2022-04-20 04:55:07,471 INFO  AlluxioFuse - Mounting AlluxioJniFuseFileSystem: mount point="/var/lib/docker/runtime-mnt/alluxio/demo/s3-data/alluxio-fuse", OPTIONS="[-obig_writes, -okernel_cache, -oro, -omax_read=131072, -oattr_timeout=7200, -oentry_timeout=7200, -ononempty, -oallow_other, -omax_write=131072]"
2022-04-20 04:55:07,472 INFO  AbstractFuseFileSystem - Mounting /var/lib/docker/runtime-mnt/alluxio/demo/s3-data/alluxio-fuse: blocking=true, debug=false, fuseOpts="[-obig_writes, -okernel_cache, -oro, -omax_read=131072, -oattr_timeout=7200, -oentry_timeout=7200, -ononempty, -oallow_other, -omax_write=131072]"
fuse: max_idle_threads: 64
fuse: max_idle_threads: 64
alluxio-fuse on /var/lib/docker/runtime-mnt/alluxio/demo/s3-data/alluxio-fuse type fuse.alluxio-fuse (ro,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other,max_read=131072)
succeed in checking mount point /var/lib/docker/runtime-mnt/alluxio/demo/s3-data
2022-04-20 04:55:10,384 WARN  AlluxioFileInStream - Failed to read block 16777216 of file /s3/wwm_uncased_L-24_H-1024_A-16/bert_config.json from worker WorkerNetAddress{host=10.250.141.108, containerHost=, rpcPort=21657, dataPort=21657, webPort=23500, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.250.141.108, rack=null)}. This worker will be skipped for future read operations, will retry: java.io.FileNotFoundException: /dev/shm/demo/s3-data/alluxioworker/16777216 (No such file or directory).
2022-04-20 04:55:11,718 WARN  AlluxioJniFuseFileSystem - Fuse.Read(path=/s3/wwm_uncased_L-24_H-1024_A-16/bert_config.json,buf=java.nio.DirectByteBuffer[pos=314 lim=314 cap=4096],size=4096,offset=0) returned 314 in 1406 ms (>=1000 ms)
fuse: max_idle_threads: 64
2022-04-20 04:55:11,723 WARN  AlluxioFileInStream - Failed to read block 50331648 of file /s3/wwm_uncased_L-24_H-1024_A-16/bert_model.ckpt.data-00000-of-00001 from worker WorkerNetAddress{host=10.250.141.108, containerHost=, rpcPort=21657, dataPort=21657, webPort=23500, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.250.141.108, rack=null)}. This worker will be skipped for future read operations, will retry: java.io.FileNotFoundException: /dev/shm/demo/s3-data/alluxioworker/50331648 (No such file or directory).
fuse: max_idle_threads: 64
2022-04-20 04:56:06,900 WARN  AlluxioFileInStream - Failed to read block 67108864 of file /s3/wwm_uncased_L-24_H-1024_A-16/bert_model.ckpt.meta from worker WorkerNetAddress{host=10.250.141.108, containerHost=, rpcPort=21657, dataPort=21657, webPort=23500, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.250.141.108, rack=null)}. This worker will be skipped for future read operations, will retry: java.io.FileNotFoundException: /dev/shm/demo/s3-data/alluxioworker/67108864 (No such file or directory).
2022-04-20 04:56:07,064 WARN  AlluxioFileInStream - Failed to read block 33554432 of file /s3/wwm_uncased_L-24_H-1024_A-16/vocab.txt from worker WorkerNetAddress{host=10.250.141.108, containerHost=, rpcPort=21657, dataPort=21657, webPort=23500, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.250.141.108, rack=null)}. This worker will be skipped for future read operations, will retry: java.io.FileNotFoundException: /dev/shm/demo/s3-data/alluxioworker/33554432 (No such file or directory).
2022-04-20 04:56:07,155 WARN  AlluxioFileInStream - Failed to read block 83886080 of file /s3/wwm_uncased_L-24_H-1024_A-16/bert_model.ckpt.index from worker WorkerNetAddress{host=10.250.141.108, containerHost=, rpcPort=21657, dataPort=21657, webPort=23500, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.250.141.108, rack=null)}. This worker will be skipped for future read operations, will retry: java.io.FileNotFoundException: /dev/shm/demo/s3-data/alluxioworker/83886080 (No such file or directory).
❯ k logs -n demo hello-5cdcc6fd66-7qfc5 hello
Begin loading models at 04:55:10

real    0m57.008s
user    0m0.000s
sys     0m1.987s
Finish loading models at 04:56:07
2022/04/20 04:56:07 helloworld: starting server...
2022/04/20 04:56:07 helloworld: listening on port 8080
lizzzcai commented 2 years ago

diagnose_fluid_1650605454.zip

attached are the logs with debug enabled. kubectl exec -n demo s3-data-master-0 -- ./bin/alluxio logLevel --logName=alluxio --level=DEBUG

lizzzcai commented 2 years ago

worker.log Hi @cheyang @ssz1997 , attached is the worker logs in today's session. Thanks for your support.

lizzzcai commented 2 years ago

@ssz1997 @cheyang

I am not sure what happens. Today I tried fluid 0.7 + alluxio/alluxio:2.7.3, but I faced the below error when using the serverless mode. In the non-serverless mode, it is working fine. attached are the logs.

diagnose_fluid_1652190814.zip

❯ k logs hello-5cdcc6fd66-qdht7 -n demo -c fluid-fuse
Exception in thread "main" java.lang.RuntimeException: Invalid property key ALLUXIO_CLIENT_HOSTNAME
        at alluxio.conf.InstancedConfiguration.lookupRecursively(InstancedConfiguration.java:454)
        at alluxio.conf.InstancedConfiguration.lookup(InstancedConfiguration.java:427)
        at alluxio.conf.InstancedConfiguration.isResolvable(InstancedConfiguration.java:170)
        at alluxio.conf.InstancedConfiguration.isSet(InstancedConfiguration.java:179)
        at alluxio.conf.AlluxioConfiguration.getOrDefault(AlluxioConfiguration.java:64)
        at alluxio.cli.GetConf.getConfImpl(GetConf.java:190)
        at alluxio.cli.GetConf.getConf(GetConf.java:147)
        at alluxio.cli.GetConf.main(GetConf.java:274)
umount: can't unmount /runtime-mnt/alluxio/demo/s3-data/alluxio-fuse: Invalid argument
Starting AlluxioFuse process: mounting alluxio path "/" to local mount point "/runtime-mnt/alluxio/demo/s3-data/alluxio-fuse" with options="big_writes,kernel_cache,ro,max_read=131072,attr_timeout=7200,entry_timeout=7200,nonempty,allow_other"
2022-05-10 13:42:59,941 INFO  AlluxioFuse - Alluxio version: 2.7.3-bf10f79cbd99afc99f2d890ceeabb1ad939c3896
2022-05-10 13:43:00,390 INFO  MetricsSystem - Starting sinks with config: {}.
2022-05-10 13:43:00,399 INFO  MetricsHeartbeatContext - Created metrics heartbeat with ID app-4814522591424508939. This ID will be used for identifying info from the client. It can be set manually through the alluxio.user.app.id property
2022-05-10 13:43:00,616 INFO  NettyUtils - EPOLL_MODE is available
2022-05-10 13:43:01,123 INFO  TieredIdentityFactory - Initialized tiered identity TieredIdentity(node=10.250.11.35, rack=null)
2022-05-10 13:43:01,151 INFO  NativeLibraryLoader - Loaded lib by jar from path /tmp/libjnifuse9026317484499696174.so.
2022-05-10 13:43:01,550 INFO  Reflections - Reflections took 378 ms to scan 1 urls, producing 58 keys and 194 values 
2022-05-10 13:43:01,636 INFO  AlluxioFuse - Mounting AlluxioJniFuseFileSystem: mount point="/runtime-mnt/alluxio/demo/s3-data/alluxio-fuse", OPTIONS="[-obig_writes, -okernel_cache, -oro, -omax_read=131072, -oattr_timeout=7200, -oentry_timeout=7200, -ononempty, -oallow_other, -omax_write=131072, -odefault_permissions]"
2022-05-10 13:43:01,636 INFO  AbstractFuseFileSystem - Mounting /runtime-mnt/alluxio/demo/s3-data/alluxio-fuse: blocking=true, debug=false, fuseOpts="[-obig_writes, -okernel_cache, -oro, -omax_read=131072, -oattr_timeout=7200, -oentry_timeout=7200, -ononempty, -oallow_other, -omax_write=131072, -odefault_permissions]"
fuse: max_idle_threads: 64
fuse: max_idle_threads: 64
timed out!
2022-05-10 13:43:29,048 INFO  AbstractFuseFileSystem - Umounting /runtime-mnt/alluxio/demo/s3-data/alluxio-fuse
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fb51c114c9f, pid=1, tid=0x00007fb51ba6bb20
#
# JRE version: OpenJDK Runtime Environment (8.0_275-b01) (build 1.8.0_275-b01)
# Java VM: OpenJDK 64-Bit Server VM (25.275-b01 mixed mode linux-amd64 compressed oops)
# Derivative: IcedTea 3.17.1
# Distribution: Custom build (Tue Feb 16 06:20:21 UTC 2021)
# Problematic frame:
# V  [libjvm.so+0x53fc9f]
#
# Core dump written. Default location: /opt/alluxio-2.7.3/core or core.1
#
# An error report file with more information is saved as:
# /opt/alluxio-2.7.3/hs_err_pid1.log
#
# If you would like to submit a bug report, please include
# instructions on how to reproduce the bug and visit:
#   https://icedtea.classpath.org/bugzilla
#
cheyang commented 2 years ago

@lizzzcai Given the latest performance testing result, can we close the issue now?

lizzzcai commented 2 years ago

@lizzzcai Given the latest performance testing result, can we close the issue now?

Hi @cheyang, we can see a good improvement from the new runtime version, this issue can be closed now.

TrafalgarZZZ commented 2 years ago

Closed by #1822