RUCAIBox / RecBole

A unified, comprehensive and efficient recommendation library
https://recbole.io/
MIT License
3.27k stars 590 forks source link

[Issue] Excessive RAM Usage in EASE Model Training #1974

Open Escape142 opened 5 months ago

Escape142 commented 5 months ago

Problem description When training the EASE model on large datasets such as Gowalla, Amazon_TV, etc., we encounter extremely high RAM consumption, exceeding 1.5 terabytes. Is this level of RAM usage expected or normal for such operations?

My configuration: General Hyper Parameters: gpu_id = 0 use_gpu = True seed = 2020 state = INFO reproducibility = True data_path = data/tmp/gowalla checkpoint_dir = saved show_progress = True save_dataset = False dataset_save_path = None save_dataloaders = False dataloaders_save_path = None log_wandb = False

Training Hyper Parameters: epochs = 300 train_batch_size = 1024 learner = adam learning_rate = 0.001 train_neg_sample_args = {'distribution': 'uniform', 'sample_num': 1, 'alpha': 1.0, 'dynamic': False, 'candidate_num': 0} eval_step = 1 stopping_step = 10 clip_grad_norm = None weight_decay = 0.0 loss_decimal_place = 4

Evaluation Hyper Parameters: eval_args = {'split': {'RS': [0.8, 0.1, 0.1]}, 'order': 'RO', 'group_by': 'user', 'mode': {'valid': 'full', 'test': 'full'}} repeatable = False metrics = ['Recall', 'MRR', 'NDCG', 'Hit', 'Precision'] topk = [10] valid_metric = MRR@10 valid_metric_bigger = True eval_batch_size = 4096 metric_decimal_place = 4

Dataset Hyper Parameters: field_separator = seq_separator =
USER_ID_FIELD = user_id ITEM_ID_FIELD = item_id RATING_FIELD = rating TIME_FIELD = timestamp seq_len = None LABEL_FIELD = label threshold = None NEGPREFIX = neg load_col = {'inter': ['user_id', 'item_id', 'rating']} unload_col = None unused_col = None additional_feat_suffix = None rm_dup_inter = None val_interval = None filter_inter_by_user_or_item = True user_inter_num_interval = [0,inf) item_inter_num_interval = [0,inf) alias_of_user_id = None alias_of_item_id = None alias_of_entity_id = None alias_of_relation_id = None preload_weight = None normalize_field = None normalize_all = None ITEM_LIST_LENGTH_FIELD = item_length LIST_SUFFIX = _list MAX_ITEM_LIST_LENGTH = 50 POSITION_FIELD = position_id HEAD_ENTITY_ID_FIELD = head_id TAIL_ENTITY_ID_FIELD = tail_id RELATION_ID_FIELD = relation_id ENTITY_ID_FIELD = entity_id benchmark_filename = ['train', 'test', 'test']

Other Hyper Parameters: worker = 0 wandb_project = recbole shuffle = True require_pow = False enable_amp = False enable_scaler = False transform = None reg_weight = 250.0 numerical_features = [] discretization = None kg_reverse_r = False entity_kg_num_interval = [0,inf) relation_kg_num_interval = [0,inf) MODEL_TYPE = ModelType.TRADITIONAL neg_sampling = None MODEL_INPUT_TYPE = InputType.POINTWISE eval_type = EvaluatorType.RANKING single_spec = True local_rank = 0 device = cuda valid_neg_sample_args = {'distribution': 'uniform', 'sample_num': 'none'} test_neg_sample_args = {'distribution': 'uniform', 'sample_num': 'none'}

Desktop:

zhengbw0324 commented 5 months ago

@Escape142 Hello! This may be normal, EASE is not a deep neural network based on gradient descent. When the data scale is large, it may occupy a large amount of memory space.