determined-ai / determined

Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.
https://determined.ai
Apache License 2.0
2.99k stars 348 forks source link

💡[feat] Adding an option to specify docker volumes besides bind_mounts. #5799

Open chjz1024 opened 1 year ago

chjz1024 commented 1 year ago

Describe the problem

Currently an experiment must specify the bind_mounts option in order to reuse existing files (like datasets) in the host. However, each agent must also have the same copy of files in the path specified by host_path to guarentee the same bahavior for each experiment. This is painful and problematic when the number of agents increases, even if NFS service is deployed to ensure data consistency e.g. bind-mounting a subdirectory of an existing NFS share seems to raise a permission problem.

The root cause is that bind_mounts option cannot specify the storage driver. In addition, many modern cloud storage solutions like HDFS of Hadoop and custom Object Storage of AWS, GCP, Azure, Alibaba must use a custom driver to be mounted as normal file storage in the container. The docker volume provides such solutions via the --mount option. For example, we can specify to use the NFS volume driver in the command line like docker run --mount 'type=volume,src=<VOLUME-NAME>,dst=<CONTAINER-PATH>,volume-driver=local,volume-opt=type=nfs,volume-opt=device=<nfs-server>:<nfs-path>,"volume-opt=o=addr=<nfs-address>,vers=4,soft,timeo=180,bg,tcp,rw"' <image> <command>. In this case, we no longer need to manually mount the NFS share in each agent or download files in each experiment or modify existing DataLoaders. Furthermore, it's also possible to use an existing cloud storage service like normal files, in which the Quality of Service is instead managed by the storage driver.

Describe the solution you'd like

Have not read the source code of this project, but I'm guessing the config is translated into raw docker run commands? In this case simply adding the --mount translation should work.

Describe alternatives you've considered

The biggest problem to me is how to easily use existing cloud storage service. So a specific solution for a common cloud storage service like NFS is also acceptable.

Additional context

Also add the option to mount tmpfs?

rb-determined-ai commented 1 year ago

This is a reasonable feature request, I'll make an internal ticket for it.

But also I want to point out that we normally don't recommend using network-mounted filesystems for training, for performance reasons.