Closed lizzzcai closed 2 years ago
Thank you for opening this issue. Could you please collect the logs by following https://github.com/fluid-cloudnative/fluid/blob/master/docs/zh/userguide/troubleshooting.md . Thanks.
diagnose_fluid_1650373486.zip @cheyang , please find attached logs. I removed the info of the s3 bucket and addon some info which are not collected successfully by the script.
@ssz1997 Could you also help take a look at this issue? Thanks. FYI @TrafalgarZZZ
Hi @lizzzcai . I've noticed that you set some properties of Alluxio:
properties:
alluxio.web.ui.enabled: "true"
alluxio.user.ufs.block.read.location.policy: alluxio.client.block.policy.LocalFirstAvoidEvictionPolicy
alluxio.user.block.size.bytes.default: 256MB
alluxio.user.streaming.reader.chunk.size.bytes: 256MB
alluxio.user.local.reader.chunk.size.bytes: 256MB
alluxio.worker.network.reader.buffer.size: 256MB
alluxio.user.streaming.data.timeout: 300sec
Would you explain why this is necessary for your scenario? In my opinion, these properties seems to large (e.g., block size for 256MB) for your scenario. Did you test with default properties provided by Fluid?
Hi @TrafalgarZZZ , I have tried the default setting at the beginning, but I can see the error of Failed to read block
. Then I found out some of this parameters from other issue (here I think https://github.com/fluid-cloudnative/fluid/pull/612) and see if it can help. In my case, the model is a single 1.3GB file.
Here I tried to run it without all the properties, below are the logs:
umount: /var/lib/docker/runtime-mnt/alluxio/demo/s3-data/alluxio-fuse: not mounted
Starting AlluxioFuse process: mounting alluxio path "/" to local mount point "/var/lib/docker/runtime-mnt/alluxio/demo/s3-data/alluxio-fuse" with options="big_writes,kernel_cache,ro,max_read=131072,attr_timeout=7200,entry_timeout=7200,nonempty,allow_other"
OpenJDK 64-Bit Server VM warning: If the number of processors is expected to increase from one, then you should configure the number of parallel GC threads appropriately using -XX:ParallelGCThreads=N
2022-04-20 04:55:05,815 INFO AlluxioFuse - Alluxio version: 2.7.2-0703ba35241e96f8405614dc20e7c7db09167abb
2022-04-20 04:55:06,289 INFO MetricsSystem - Starting sinks with config: {}.
2022-04-20 04:55:06,298 INFO MetricsHeartbeatContext - Created metrics heartbeat with ID app-2532123806612390426. This ID will be used for identifying info from the client. It can be set manually through the alluxio.user.app.id property
2022-04-20 04:55:06,492 INFO NettyUtils - EPOLL_MODE is available
2022-04-20 04:55:07,034 INFO TieredIdentityFactory - Initialized tiered identity TieredIdentity(node=10.250.141.108, rack=null)
2022-04-20 04:55:07,053 INFO NativeLibraryLoader - Loaded lib by jar from path /tmp/libjnifuse659397235779862725.so.
2022-04-20 04:55:07,408 INFO Reflections - Reflections took 339 ms to scan 1 urls, producing 58 keys and 194 values
2022-04-20 04:55:07,471 INFO AlluxioFuse - Mounting AlluxioJniFuseFileSystem: mount point="/var/lib/docker/runtime-mnt/alluxio/demo/s3-data/alluxio-fuse", OPTIONS="[-obig_writes, -okernel_cache, -oro, -omax_read=131072, -oattr_timeout=7200, -oentry_timeout=7200, -ononempty, -oallow_other, -omax_write=131072]"
2022-04-20 04:55:07,472 INFO AbstractFuseFileSystem - Mounting /var/lib/docker/runtime-mnt/alluxio/demo/s3-data/alluxio-fuse: blocking=true, debug=false, fuseOpts="[-obig_writes, -okernel_cache, -oro, -omax_read=131072, -oattr_timeout=7200, -oentry_timeout=7200, -ononempty, -oallow_other, -omax_write=131072]"
fuse: max_idle_threads: 64
fuse: max_idle_threads: 64
alluxio-fuse on /var/lib/docker/runtime-mnt/alluxio/demo/s3-data/alluxio-fuse type fuse.alluxio-fuse (ro,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other,max_read=131072)
succeed in checking mount point /var/lib/docker/runtime-mnt/alluxio/demo/s3-data
2022-04-20 04:55:10,384 WARN AlluxioFileInStream - Failed to read block 16777216 of file /s3/wwm_uncased_L-24_H-1024_A-16/bert_config.json from worker WorkerNetAddress{host=10.250.141.108, containerHost=, rpcPort=21657, dataPort=21657, webPort=23500, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.250.141.108, rack=null)}. This worker will be skipped for future read operations, will retry: java.io.FileNotFoundException: /dev/shm/demo/s3-data/alluxioworker/16777216 (No such file or directory).
2022-04-20 04:55:11,718 WARN AlluxioJniFuseFileSystem - Fuse.Read(path=/s3/wwm_uncased_L-24_H-1024_A-16/bert_config.json,buf=java.nio.DirectByteBuffer[pos=314 lim=314 cap=4096],size=4096,offset=0) returned 314 in 1406 ms (>=1000 ms)
fuse: max_idle_threads: 64
2022-04-20 04:55:11,723 WARN AlluxioFileInStream - Failed to read block 50331648 of file /s3/wwm_uncased_L-24_H-1024_A-16/bert_model.ckpt.data-00000-of-00001 from worker WorkerNetAddress{host=10.250.141.108, containerHost=, rpcPort=21657, dataPort=21657, webPort=23500, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.250.141.108, rack=null)}. This worker will be skipped for future read operations, will retry: java.io.FileNotFoundException: /dev/shm/demo/s3-data/alluxioworker/50331648 (No such file or directory).
fuse: max_idle_threads: 64
2022-04-20 04:56:06,900 WARN AlluxioFileInStream - Failed to read block 67108864 of file /s3/wwm_uncased_L-24_H-1024_A-16/bert_model.ckpt.meta from worker WorkerNetAddress{host=10.250.141.108, containerHost=, rpcPort=21657, dataPort=21657, webPort=23500, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.250.141.108, rack=null)}. This worker will be skipped for future read operations, will retry: java.io.FileNotFoundException: /dev/shm/demo/s3-data/alluxioworker/67108864 (No such file or directory).
2022-04-20 04:56:07,064 WARN AlluxioFileInStream - Failed to read block 33554432 of file /s3/wwm_uncased_L-24_H-1024_A-16/vocab.txt from worker WorkerNetAddress{host=10.250.141.108, containerHost=, rpcPort=21657, dataPort=21657, webPort=23500, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.250.141.108, rack=null)}. This worker will be skipped for future read operations, will retry: java.io.FileNotFoundException: /dev/shm/demo/s3-data/alluxioworker/33554432 (No such file or directory).
2022-04-20 04:56:07,155 WARN AlluxioFileInStream - Failed to read block 83886080 of file /s3/wwm_uncased_L-24_H-1024_A-16/bert_model.ckpt.index from worker WorkerNetAddress{host=10.250.141.108, containerHost=, rpcPort=21657, dataPort=21657, webPort=23500, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.250.141.108, rack=null)}. This worker will be skipped for future read operations, will retry: java.io.FileNotFoundException: /dev/shm/demo/s3-data/alluxioworker/83886080 (No such file or directory).
❯ k logs -n demo hello-5cdcc6fd66-7qfc5 hello
Begin loading models at 04:55:10
real 0m57.008s
user 0m0.000s
sys 0m1.987s
Finish loading models at 04:56:07
2022/04/20 04:56:07 helloworld: starting server...
2022/04/20 04:56:07 helloworld: listening on port 8080
attached are the logs with debug enabled. kubectl exec -n demo s3-data-master-0 -- ./bin/alluxio logLevel --logName=alluxio --level=DEBUG
worker.log Hi @cheyang @ssz1997 , attached is the worker logs in today's session. Thanks for your support.
@ssz1997 @cheyang
I am not sure what happens. Today I tried fluid 0.7 + alluxio/alluxio:2.7.3
, but I faced the below error when using the serverless mode. In the non-serverless mode, it is working fine. attached are the logs.
❯ k logs hello-5cdcc6fd66-qdht7 -n demo -c fluid-fuse
Exception in thread "main" java.lang.RuntimeException: Invalid property key ALLUXIO_CLIENT_HOSTNAME
at alluxio.conf.InstancedConfiguration.lookupRecursively(InstancedConfiguration.java:454)
at alluxio.conf.InstancedConfiguration.lookup(InstancedConfiguration.java:427)
at alluxio.conf.InstancedConfiguration.isResolvable(InstancedConfiguration.java:170)
at alluxio.conf.InstancedConfiguration.isSet(InstancedConfiguration.java:179)
at alluxio.conf.AlluxioConfiguration.getOrDefault(AlluxioConfiguration.java:64)
at alluxio.cli.GetConf.getConfImpl(GetConf.java:190)
at alluxio.cli.GetConf.getConf(GetConf.java:147)
at alluxio.cli.GetConf.main(GetConf.java:274)
umount: can't unmount /runtime-mnt/alluxio/demo/s3-data/alluxio-fuse: Invalid argument
Starting AlluxioFuse process: mounting alluxio path "/" to local mount point "/runtime-mnt/alluxio/demo/s3-data/alluxio-fuse" with options="big_writes,kernel_cache,ro,max_read=131072,attr_timeout=7200,entry_timeout=7200,nonempty,allow_other"
2022-05-10 13:42:59,941 INFO AlluxioFuse - Alluxio version: 2.7.3-bf10f79cbd99afc99f2d890ceeabb1ad939c3896
2022-05-10 13:43:00,390 INFO MetricsSystem - Starting sinks with config: {}.
2022-05-10 13:43:00,399 INFO MetricsHeartbeatContext - Created metrics heartbeat with ID app-4814522591424508939. This ID will be used for identifying info from the client. It can be set manually through the alluxio.user.app.id property
2022-05-10 13:43:00,616 INFO NettyUtils - EPOLL_MODE is available
2022-05-10 13:43:01,123 INFO TieredIdentityFactory - Initialized tiered identity TieredIdentity(node=10.250.11.35, rack=null)
2022-05-10 13:43:01,151 INFO NativeLibraryLoader - Loaded lib by jar from path /tmp/libjnifuse9026317484499696174.so.
2022-05-10 13:43:01,550 INFO Reflections - Reflections took 378 ms to scan 1 urls, producing 58 keys and 194 values
2022-05-10 13:43:01,636 INFO AlluxioFuse - Mounting AlluxioJniFuseFileSystem: mount point="/runtime-mnt/alluxio/demo/s3-data/alluxio-fuse", OPTIONS="[-obig_writes, -okernel_cache, -oro, -omax_read=131072, -oattr_timeout=7200, -oentry_timeout=7200, -ononempty, -oallow_other, -omax_write=131072, -odefault_permissions]"
2022-05-10 13:43:01,636 INFO AbstractFuseFileSystem - Mounting /runtime-mnt/alluxio/demo/s3-data/alluxio-fuse: blocking=true, debug=false, fuseOpts="[-obig_writes, -okernel_cache, -oro, -omax_read=131072, -oattr_timeout=7200, -oentry_timeout=7200, -ononempty, -oallow_other, -omax_write=131072, -odefault_permissions]"
fuse: max_idle_threads: 64
fuse: max_idle_threads: 64
timed out!
2022-05-10 13:43:29,048 INFO AbstractFuseFileSystem - Umounting /runtime-mnt/alluxio/demo/s3-data/alluxio-fuse
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007fb51c114c9f, pid=1, tid=0x00007fb51ba6bb20
#
# JRE version: OpenJDK Runtime Environment (8.0_275-b01) (build 1.8.0_275-b01)
# Java VM: OpenJDK 64-Bit Server VM (25.275-b01 mixed mode linux-amd64 compressed oops)
# Derivative: IcedTea 3.17.1
# Distribution: Custom build (Tue Feb 16 06:20:21 UTC 2021)
# Problematic frame:
# V [libjvm.so+0x53fc9f]
#
# Core dump written. Default location: /opt/alluxio-2.7.3/core or core.1
#
# An error report file with more information is saved as:
# /opt/alluxio-2.7.3/hs_err_pid1.log
#
# If you would like to submit a bug report, please include
# instructions on how to reproduce the bug and visit:
# https://icedtea.classpath.org/bugzilla
#
@lizzzcai Given the latest performance testing result, can we close the issue now?
@lizzzcai Given the latest performance testing result, can we close the issue now?
Hi @cheyang, we can see a good improvement from the new runtime version, this issue can be closed now.
Closed by #1822
Hi Fluid team, I am trying to follow the serverless demo here.
Instead of using data from web, my dataset is from
S3
. (A 1.3 GB model BERT-Large, Uncased (Whole Word Masking) )My k8s version is 1.21.10 with 3 nodes. I labelled two of them with
tenantId: tenant1
. Fluid version:v0.7
.below is the example of my dataset:
Below is my deployment:
After applying my dataset, the logs are fine from the dataset master
Then I try to deploy my deployment. below are the logs.
Checking the dataset and runtime, the data were cached.
I delete the deployment and redeploy it again.
As you can see, the improvement is not so significant. Even I can see some
warning
on reading block in the fuse sidecar.alluxio/alluxio-dev:2.72
is 3.2 GB), I think with this image size it is hard to meet the serverless requirement.