Speed up the training with Mistral 7B

YifeiZhou02 / ArCHer

Research Code for "ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL"

https://yifeizhou02.github.io/archer.io/

83 stars 10 forks source link

Speed up the training with Mistral 7B #10

Open Bohemianc opened 1 month ago

Bohemianc commented 1 month ago

Hello,

I am currently training ArCHer with Mistral 7B on Twenty Questions using 32GB V100 GPUs, but it's taking longer than expected. Could you share any advice on parameter settings that might speed up the training, even at the expense of accuracy? Also, I am interested in the type of GPUs used by the authors.

YifeiZhou02 commented 1 month ago

Hi,

Thanks for your interest in our work. Our experiments on Mistral 7B are carried out using 2x80GB A100.

Could you identify which part of the training is the speed bottleneck? If collecting online trajectories is the bottleneck, one possibility is to check if it is possible to collect trajectories with multiple threads in parallel. In current implementation, I believe data collection is done through single thread in the main process alone.

Bohemianc commented 1 month ago

Hi there,

Thanks for the suggestion on parallel data collection.

By the way, I've noticed a potential issue with the timeout parameter in here. It seems the timeout isn't being set correctly, which could default to 10 minutes and cause errors on slower hardware.

Here's the corrected line:

accelerator = Accelerator(kwargs_handlers=[InitProcessGroupKwargs(timeout=timedelta(seconds=1800))])

This should set the NCCL timeout as intended. Hope this helps.

ggbondcxl commented 1 month ago

Hi,

Thanks for your interest in our work. Our experiments on Mistral 7B are carried out using 2x80GB A100.

Could you identify which part of the training is the speed bottleneck? If collecting online trajectories is the bottleneck, one possibility is to check if it is possible to collect trajectories with multiple threads in parallel. In current implementation, I believe data collection is done through single thread in the main process alone.

Can I know how long it would take to complete the training with two A100 80Gs in the current situation?

ggbondcxl commented 1 month ago

As well as found that after I downloaded yesterday, it prompted a problem with the tokenizer, I wonder if it's due to a problem with the AutoTokenizer. 截屏2024-07-25 15 35 02

YifeiZhou02 commented 1 month ago

Thanks for your interest. The result in the paper is done through 2 days of training on 2xA100 80G. Would you like to provide what's the error message for AutoTokenizer? It seems to be working fine by the time when I reproduced the result 6 months ago.

Also thanks for correcting the code for setting the thread timeouts!