eth-easl / modyn

Modyn is a research-platform for training ML models on growing datasets.
MIT License
29 stars 3 forks source link

Useability issues on different machines #303

Open Sipondo opened 1 year ago

Sipondo commented 1 year ago

This is an umbrella issue for reporting useability issues when running Modyn on different hardware.

Sipondo commented 1 year ago

The setup instructions ask to run the setup script via ./initial_setup.sh and this bash file writes a little token to signify whether it has run or not. The script however does not stop if any of the commands exit incorrectly (e.g. wrong docker installation) and the script will still write the file, having the user to delete the token manually.

Changing to bash -e ./initial_setup.sh should resolve this issue.

Sipondo commented 1 year ago

~The installation instructions do not specify that the listed conda/mamba environment is for CPU only. Perhaps we should add a note, and maybe even add a separate environment_cuda.yml instead of having the user change the .yml file.~

Add a note to the .yml source file that the lines specific to CUDA are managed by the initialisation script.

Sipondo commented 1 year ago

For new GPUs, CUDA Toolkit version 11.8 or higher may be required. This version is no longer available as a conda package directly, and instead four packages should be installed:

  - nvidia::cuda-libraries-dev=11.8.*
  - nvidia::cuda-nvcc=11.8.*
  - nvidia::cuda-nvtx=11.8.*
  - nvidia::cuda-cupti=11.8.*

Older versions are also available under these packages, so we should simply switch to these.

Untitled
Sipondo commented 1 year ago

Docker volumes can cause problems when setting up Modyn (for the first time). Perhaps either move away from using them, or add documentation for troubleshooting these issues.

Sipondo commented 1 year ago

The modyn-supervisor command in the example (https://github.com/eth-easl/modyn/blob/main/docs/EXAMPLE.md) is missing an evaluation directory.

Sipondo commented 1 year ago

https://github.com/eth-easl/modyn/tree/main/benchmark/mnist seems to be outdated and suggests running mnist-supervisor instead of modyn-supervisor. Perhaps we should simply link to https://github.com/eth-easl/modyn/blob/main/docs/EXAMPLE.md or merge these documents.

francescodeaglio commented 1 year ago

I'll add here that seems like the right place. Currently, the visibility of containers is limited to GPU0, so if more are available, you can't use them. User-side fix: change count in the CUDA section of docker-compose.