Alluxio / alluxio

Alluxio, data orchestration for analytics and machine learning in the cloud
https://www.alluxio.io
Apache License 2.0
6.87k stars 2.94k forks source link

Support 100million - 1 billion small files in AI/ML training #14932

Open LuQQiu opened 2 years ago

LuQQiu commented 2 years ago

Is your feature request related to a problem? Please describe. Training against Alluxio has good performance under 100 million small files. When the training dataset reaches 100 million to 1 billion small files, training performance is largely impacted, especially when the training job does global data shuffle between epochs and multiple nodes are involved

The impacted training performance comes from that Alluxio Fuse client may not be able to store all the cached metadata in process memory. All local metadata cache is invalidated between epochs. Alluxio master needs to serve global metadata requests during each epoch.

Describe the solution you'd like

  1. Improve the ability for Alluxio master to serve metadata requests. One approach is enabling standby masters to serve metadata read requests.
  2. Improve the ability of the Alluxio client to cache metadata locally.
    • see how much file metadata can be cached locally, how much RAM is needed.
    • If RAM is not enough, support client-side metadata cache using ROCKS?
    • support pinning client-side metadata? if local metadata can only cache 40% total metadata, 40% always serves locally and 60% always serves remotely. Not all locally cached metadata will be useless between shuffled epochs.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Urgency Explain why the feature is important

Additional context Add any other context or screenshots about the feature request here.

LuQQiu commented 2 years ago

@yyongycy @yuzhu @maobaolong @ssz1997 @apc999 Create a github issue for tracking the ideas of how to support 100 million or 1 billion small files. Please share your suggestions. We will have discussions in the future

yyongycy commented 2 years ago

A few heuristic questions:

  1. Are all files modified/created? How many static files(only file atime change)?
  2. Can one Master support that many metadata in performing way?
  3. Are all cached metadata not usable in next epoch?
  4. Is the hardware config(cpu/memory/network) matching the data scale that is managing?
jayzhenghan commented 2 years ago

How to access files where goosefs does not exist but UFS exists?【follower-readOnly】