Open VincentLeeMax opened 1 year ago
Thanks for raising the issue. Before we dive into this, I would recommend trying out Alluxio 3xx instead of 2.9.3? The performance has been proven to be better than Alluxio v2. For the newest version you can refer to the doc here: https://docs.alluxio.io/os/user/edge/en/Overview.html
Alluxio Version: 2.9.3
Describe the bug I use alluxio in tensorflow training situation to replace the CephFS(Source data are in HDFS). And I found that when I used more dataset reading threads(necessary for specify read behavior), the training speed drop about 10%. The host cpu load is almost the same(20% cpu usage).
I did the same training using the CephFS directly, it have better performance when increasing the parallelism.
After profiling using async-profiler, the limitations seems to come from fuse kernel. Please help me analysis it. profile_result.zip
Expected behavior After increasing the parallelism, it should have better performance or remain the same.