Alluxio / alluxio

Alluxio, data orchestration for analytics and machine learning in the cloud
https://www.alluxio.io
Apache License 2.0
6.82k stars 2.93k forks source link

Performance drop when increasing the reading parallelism #18037

Open VincentLeeMax opened 1 year ago

VincentLeeMax commented 1 year ago

Alluxio Version: 2.9.3

Describe the bug I use alluxio in tensorflow training situation to replace the CephFS(Source data are in HDFS). And I found that when I used more dataset reading threads(necessary for specify read behavior), the training speed drop about 10%. The host cpu load is almost the same(20% cpu usage).

I did the same training using the CephFS directly, it have better performance when increasing the parallelism.

After profiling using async-profiler, the limitations seems to come from fuse kernel. Please help me analysis it. profile_result.zip

Expected behavior After increasing the parallelism, it should have better performance or remain the same.

ssz1997 commented 1 year ago

Thanks for raising the issue. Before we dive into this, I would recommend trying out Alluxio 3xx instead of 2.9.3? The performance has been proven to be better than Alluxio v2. For the newest version you can refer to the doc here: https://docs.alluxio.io/os/user/edge/en/Overview.html