RUCAIBox / RecBole

A unified, comprehensive and efficient recommendation library
https://recbole.io/
MIT License
3.39k stars 610 forks source link

[🐛BUG] Multi-GPU problem #1684

Open fenneccat opened 1 year ago

fenneccat commented 1 year ago

Describe the bug Hello, I tried to use multi-gpu setting in Recbole, but it doesn't work. it returns error like below

스크린샷 2023-03-08 오후 4 53 18

To Reproduce Steps to reproduce the behavior:

  1. extra yaml file I only changed first two line in overall.yaml file.

    Environment Settings

    gpu_id: '0,1,2,3' # (str) The id of GPU device(s). worker: 4 # (int) The number of workers processing the data. use_gpu: True # (bool) Whether or not to use GPU. seed: 2020 # (int) Random seed. state: INFO # (str) Logging level. reproducibility: True # (bool) Whether or not to make results reproducible. data_path: 'dataset/' # (str) The path of input dataset. checkpoint_dir: 'saved' # (str) The path to save checkpoint file. show_progress: True # (bool) Whether or not to show the progress bar of every epoch. save_dataset: False # (bool) Whether or not to save filtered dataset. dataset_save_path: ~ # (str) The path of saved dataset. save_dataloaders: False # (bool) Whether or not save split dataloaders. dataloaders_save_path: ~ # (str) The path of saved dataloaders. log_wandb: False # (bool) Whether or not to use Weights & Biases(W&B). wandb_project: 'recbole' # (str) The project to conduct experiments in W&B. shuffle: True # (bool) Whether or not to shuffle the training data before each epoch.

Training Settings

epochs: 300 # (int) The number of training epochs. train_batch_size: 2048 # (int) The training batch size. learner: adam # (str) The name of used optimizer. learning_rate: 0.001 # (float) Learning rate. train_neg_sample_args: # (dict) Negative sampling configuration for model training. distribution: uniform # (str) The distribution of negative items. sample_num: 1 # (int) The sampled num of negative items. alpha: 1.0 # (float) The power of sampling probability for popularity distribution. dynamic: False # (bool) Whether to use dynamic negative sampling. candidate_num: 0 # (int) The number of candidate negative items when dynamic negative sampling. eval_step: 1 # (int) The number of training epochs before an evaluation on the valid dataset. stopping_step: 10 # (int) The threshold for validation-based early stopping. clip_grad_norm: ~ # (dict) The args of clip_gradnorm which will clip gradient norm of model. weight_decay: 0.0 # (float) The weight decay value (L2 penalty) for optimizers. loss_decimal_place: 4 # (int) The decimal place of training loss. require_pow: False # (bool) Whether or not to perform power operation in EmbLoss. enable_amp: False # (bool) Whether or not to use mixed precision training. enable_scaler: False # (bool) Whether or not to use GradScaler in mixed precision training. transform: ~ # (str) The transform operation for batch data process.

Evaluation Settings

eval_args: # (dict) 4 keys: group_by, order, split, and mode split: {'RS':[0.8,0.1,0.1]} # (dict) The splitting strategy ranging in ['RS','LS']. group_by: user # (str) The grouping strategy ranging in ['user', 'none']. order: RO # (str) The ordering strategy ranging in ['RO', 'TO']. mode: full # (str) The evaluation mode ranging in ['full','unixxx','popxxx','labeled']. repeatable: False # (bool) Whether to evaluate results with a repeatable recommendation scene. metrics: ["Recall","MRR","NDCG","Hit","Precision"] # (list or str) Evaluation metrics. topk: [10] # (list or int or None) The value of k for topk evaluation metrics. valid_metric: MRR@10 # (str) The evaluation metric for early stopping. valid_metric_bigger: True # (bool) Whether to take a bigger valid metric value as a better result. eval_batch_size: 4096 # (int) The evaluation batch size. metric_decimal_place: 4 # (int) The decimal place of metric scores.

  1. your code
  2. script for running python run_recbole.py --nproc=4

Expected behavior I wanted to test distributed GPU usage. I have a single node with 4 GPUs. When I tried to give a command: python run_recbole.py --nproc=4 It returns error can't pickle torch._C.Generator objects

Can you give me a guide how to run code with multi-GPU setting?

Screenshots If applicable, add screenshots to help explain your problem.

Colab Links If applicable, add links to Colab or other Jupyter laboratory platforms that can reproduce the bug.

Desktop (please complete the following information):

leoleojie commented 1 year ago

@fenneccat Thanks for your attention to RecBole. I think the problem maybe caused by the incorrect setting of worker parameter. Actually, it is the number of workers processing the data and you should set it as 1. Since when you use multi-gpu in RecBole, there is already four processes. When set the worker:4, each process would be split into four processes again. This may be causing your problem

fenneccat commented 1 year ago

I still can't use multi-GPU option even though I've changed worker parameter. Can you tell give me the config file and command to run the code with a sample dataset? Also, should I change overall.yaml file? It would be helpful to know how to run a multi-GPU option. ml-100k dataset or Amazon_Beauty would be nice example to understand. Thank you

ChenglongMa commented 1 year ago

Hi @fenneccat, I got the same error and fixed it by setting worker back to 0 in overall.yaml according to @leoleojie's suggestion.

I ran my code on one server with 3 gpus.

Firstly, I made the following change in overall.yaml:

gpu_id: '0,1,2'

Then, I tested the BPR model on the default ml-100k dataset by running:

python run_recbole.py --nproc 3

The code can work as expected and no errors were reported.

Hope this can help you. Thanks.

Baxkiller commented 1 year ago

Hi @fenneccat, I got the same error and fixed it by setting worker back to 0 in overall.yaml according to @leoleojie's suggestion.

I ran my code on one server with 3 gpus.

Firstly, I made the following change in overall.yaml:

gpu_id: '0,1,2'

Then, I tested the BPR model on the default ml-100k dataset by running:

python run_recbole.py --nproc 3

The code can work as expected and no errors were reported.

Hope this can help you. Thanks.

Hello @ChenglongMa ,thanks for your reply. I want to confirm to you, did you set the worker parameter to 0 instead of the 1 suggested by @leoleojie ? Looking forward to your reply, thank you.

ChenglongMa commented 1 year ago

Hi @Baxkiller,

I re-tested my code and found that 0 works as expected but 1 will raise the same error as above.

From https://pytorch.org/docs/stable/data.html#single-and-multi-process-data-loading, we can see

0 means that the data will be loaded in the main process.

I guess this is required by spawn.

Thanks!