Open fenneccat opened 1 year ago
@fenneccat Thanks for your attention to RecBole.
I think the problem maybe caused by the incorrect setting of worker
parameter. Actually, it is the number of workers processing the data and you should set it as 1
.
Since when you use multi-gpu in RecBole, there is already four processes. When set the worker:4
, each process would be split into four processes again. This may be causing your problem
I still can't use multi-GPU option even though I've changed worker parameter. Can you tell give me the config file and command to run the code with a sample dataset? Also, should I change overall.yaml file? It would be helpful to know how to run a multi-GPU option. ml-100k dataset or Amazon_Beauty would be nice example to understand. Thank you
Hi @fenneccat,
I got the same error and fixed it by setting worker
back to 0
in overall.yaml
according to @leoleojie's suggestion.
I ran my code on one server with 3 gpus.
Firstly, I made the following change in overall.yaml
:
gpu_id: '0,1,2'
Then, I tested the BPR
model on the default ml-100k
dataset by running:
python run_recbole.py --nproc 3
The code can work as expected and no errors were reported.
Hope this can help you. Thanks.
Hi @fenneccat, I got the same error and fixed it by setting
worker
back to0
inoverall.yaml
according to @leoleojie's suggestion.I ran my code on one server with 3 gpus.
Firstly, I made the following change in
overall.yaml
:gpu_id: '0,1,2'
Then, I tested the
BPR
model on the defaultml-100k
dataset by running:python run_recbole.py --nproc 3
The code can work as expected and no errors were reported.
Hope this can help you. Thanks.
Hello @ChenglongMa ,thanks for your reply.
I want to confirm to you, did you set the worker
parameter to 0
instead of the 1
suggested by @leoleojie ?
Looking forward to your reply, thank you.
Hi @Baxkiller,
I re-tested my code and found that 0
works as expected but 1
will raise the same error as above.
From https://pytorch.org/docs/stable/data.html#single-and-multi-process-data-loading, we can see
0 means that the data will be loaded in the main process.
I guess this is required by spawn
.
Thanks!
Describe the bug Hello, I tried to use multi-gpu setting in Recbole, but it doesn't work. it returns error like below
To Reproduce Steps to reproduce the behavior:
Environment Settings
gpu_id: '0,1,2,3' # (str) The id of GPU device(s). worker: 4 # (int) The number of workers processing the data. use_gpu: True # (bool) Whether or not to use GPU. seed: 2020 # (int) Random seed. state: INFO # (str) Logging level. reproducibility: True # (bool) Whether or not to make results reproducible. data_path: 'dataset/' # (str) The path of input dataset. checkpoint_dir: 'saved' # (str) The path to save checkpoint file. show_progress: True # (bool) Whether or not to show the progress bar of every epoch. save_dataset: False # (bool) Whether or not to save filtered dataset. dataset_save_path: ~ # (str) The path of saved dataset. save_dataloaders: False # (bool) Whether or not save split dataloaders. dataloaders_save_path: ~ # (str) The path of saved dataloaders. log_wandb: False # (bool) Whether or not to use Weights & Biases(W&B). wandb_project: 'recbole' # (str) The project to conduct experiments in W&B. shuffle: True # (bool) Whether or not to shuffle the training data before each epoch.
Training Settings
epochs: 300 # (int) The number of training epochs. train_batch_size: 2048 # (int) The training batch size. learner: adam # (str) The name of used optimizer. learning_rate: 0.001 # (float) Learning rate. train_neg_sample_args: # (dict) Negative sampling configuration for model training. distribution: uniform # (str) The distribution of negative items. sample_num: 1 # (int) The sampled num of negative items. alpha: 1.0 # (float) The power of sampling probability for popularity distribution. dynamic: False # (bool) Whether to use dynamic negative sampling. candidate_num: 0 # (int) The number of candidate negative items when dynamic negative sampling. eval_step: 1 # (int) The number of training epochs before an evaluation on the valid dataset. stopping_step: 10 # (int) The threshold for validation-based early stopping. clip_grad_norm: ~ # (dict) The args of clip_gradnorm which will clip gradient norm of model. weight_decay: 0.0 # (float) The weight decay value (L2 penalty) for optimizers. loss_decimal_place: 4 # (int) The decimal place of training loss. require_pow: False # (bool) Whether or not to perform power operation in EmbLoss. enable_amp: False # (bool) Whether or not to use mixed precision training. enable_scaler: False # (bool) Whether or not to use GradScaler in mixed precision training. transform: ~ # (str) The transform operation for batch data process.
Evaluation Settings
eval_args: # (dict) 4 keys: group_by, order, split, and mode split: {'RS':[0.8,0.1,0.1]} # (dict) The splitting strategy ranging in ['RS','LS']. group_by: user # (str) The grouping strategy ranging in ['user', 'none']. order: RO # (str) The ordering strategy ranging in ['RO', 'TO']. mode: full # (str) The evaluation mode ranging in ['full','unixxx','popxxx','labeled']. repeatable: False # (bool) Whether to evaluate results with a repeatable recommendation scene. metrics: ["Recall","MRR","NDCG","Hit","Precision"] # (list or str) Evaluation metrics. topk: [10] # (list or int or None) The value of k for topk evaluation metrics. valid_metric: MRR@10 # (str) The evaluation metric for early stopping. valid_metric_bigger: True # (bool) Whether to take a bigger valid metric value as a better result. eval_batch_size: 4096 # (int) The evaluation batch size. metric_decimal_place: 4 # (int) The decimal place of metric scores.
python run_recbole.py --nproc=4
Expected behavior I wanted to test distributed GPU usage. I have a single node with 4 GPUs. When I tried to give a command:
python run_recbole.py --nproc=4
It returns errorcan't pickle torch._C.Generator objects
Can you give me a guide how to run code with multi-GPU setting?
Screenshots If applicable, add screenshots to help explain your problem.
Colab Links If applicable, add links to Colab or other Jupyter laboratory platforms that can reproduce the bug.
Desktop (please complete the following information):