apple / axlearn

An Extensible Deep Learning Library
Apache License 2.0
1.88k stars 269 forks source link

Implement custom `max_data_shard_degree` and `shard_threshold_bytes` #838

Closed hanzhi713 closed 1 week ago

hanzhi713 commented 1 week ago

Users can tune these two nobs to reduce the number of checkpoint files when data sharding is enabled.