ZhangYuanhan-AI / NOAH

[TPAMI] Searching prompt modules for parameter-efficient transfer learning.
MIT License
225 stars 11 forks source link

About the slurm setting #5

Closed fanq15 closed 2 years ago

fanq15 commented 2 years ago

Thanks for your great work! It seems that all the experiments require the slurm on multiple machines, right?

ZhangYuanhan-AI commented 2 years ago

Hi,

Thanks for your interest in our work!

You can easily choose your launcher in the shell file, and run the NOAH.

fanq15 commented 2 years ago

Thanks for your quick reply! But I found that even if I use the pytorch launcher, the code still can not run with the following bug:

srun: error: Unable to resolve "node0": Unknown host
srun: error: Unable to establish control machine address
srun: error: Unable to allocate resources: No error
ZhangYuanhan-AI commented 2 years ago

Hi,

Thank you for the catch.

  1. We will check this problem asap.
  2. Since all our experiments require only one GPU, you can use ''none'' as your launcher, it works.
fanq15 commented 2 years ago

Thanks! It seems because there is always a srun in the commend.

ZhangYuanhan-AI commented 2 years ago

For sure, srun is only for slurm

fanq15 commented 2 years ago

If I change the launcher from slurm to None, what should I do for the srun?

https://github.com/Davidzhangyuanhan/NOAH/blob/0b85ebe3641788eb8514897fec7319eafa8264e4/configs/Adapter/VTAB/slurm_train_adapter_vtab.sh#L26

ZhangYuanhan-AI commented 2 years ago

If I change the launcher from slurm to None, what should I do for the srun?

https://github.com/Davidzhangyuanhan/NOAH/blob/0b85ebe3641788eb8514897fec7319eafa8264e4/configs/Adapter/VTAB/slurm_train_adapter_vtab.sh#L26

Deleting line26-line33 directly.

fanq15 commented 2 years ago

Thanks! It works! Suggest setting the single machine training as the default code. After all, most researchers are not rich enough to have multiple machines for training and will be unfamiliar with the slurm code.

ZhangYuanhan-AI commented 2 years ago

Thanks! It works! Suggest setting the single machine training as the default code. After all, most researchers are not rich enough to have multiple machines for training and will be unfamiliar with the slurm code.

Nice suggestion!

We will add the single machine training code as you recommend.

fanq15 commented 2 years ago

Another question. When I use the none launcher, the experiments of different datasets are sequentially conducted on one GPU. Is it possible to apply multiple GPU training using the none launcher?

fanq15 commented 2 years ago

The training set is very small and thus it seems we do not need multiple GPU training.

ZhangYuanhan-AI commented 2 years ago

Indeed.

fanq15 commented 2 years ago

Thanks!

ZhangYuanhan-AI commented 2 years ago

Enjoy NOAH!