[🐛BUG] Dataset save_path mismatches load_path for sequential dataset

RUCAIBox / RecBole

A unified, comprehensive and efficient recommendation library

https://recbole.io/

MIT License

3.33k stars 601 forks source link

[🐛BUG] Dataset save_path mismatches load_path for sequential dataset #1697

Closed ShadowTinker closed 1 year ago

ShadowTinker commented 1 year ago

Describe the bug I set save_dataset=True in recbole/properties/overall.yaml but find the model re-processes the dataset every time.

To Reproduce Steps to reproduce the behavior:

set save_dataset=True in recbole/properties/overall.yaml
execute python run_recbole.py --model=DIN --dataset=ml-100k
results
- save_path: path_prefix/ml-100k-dataset.pth
- load_path: path_prefix/ml-100k-SequentialDataset.pth
way to fix it I change the following code https://github.com/RUCAIBox/RecBole/blob/edf2e6cb8190241b6bb16d72a2ad8f2ff004845f/recbole/data/dataset/dataset.py#L1813 to
```
file = os.path.join(save_dir, f'{self.config["dataset"]}-{self.__class__.__name__}.pth')
```
. And it works for DIN, AFM, BPR and SASRec.

Desktop (please complete the following information):

OS: Linux
- RecBole Version 1.1.1
Python Version 3.9.15
- PyTorch Version 1.13.1
- cudatoolkit Version 11.6

ShadowTinker commented 1 year ago

Besides, I also find some bugs in loading saved dataloaders by setting save_dataloaders=True in the config file. In the following code, I find eval_neg_sample_args does not exist in the config while valid_neg_sample_args or test_neg_sample_args exists, https://github.com/RUCAIBox/RecBole/blob/edf2e6cb8190241b6bb16d72a2ad8f2ff004845f/recbole/data/dataloader/general_dataloader.py#L140-L144 , which will result in a key error in line139 in the following code: https://github.com/RUCAIBox/RecBole/blob/edf2e6cb8190241b6bb16d72a2ad8f2ff004845f/recbole/data/dataloader/abstract_dataloader.py#L132-L142 I opened a pull request #1698 to fix the two problems mentioned above.

Paitesanshi commented 1 year ago

Hello @ShadowTinker,

Thank you for your attention and contributions to RecBole! You are correct that there are some bugs. To improve the rationality of the code structure and convenience of the changes, I have made some minor modifications. Once you have reviewed the modifications in https://github.com/RUCAIBox/RecBole/pull/1698, the updates can be merged into the main code.

Thank you again for your support to our team!

ShadowTinker commented 1 year ago

Hello @Paitesanshi,

Thank you for your reply! But I just find another problem, which is that the saved dataloader will still be loaded when modifying train_batch_size or eval_batch_size, while the batch_size is not changed correspondingly. And I find it is caused by the following lines: https://github.com/RUCAIBox/RecBole/blob/edf2e6cb8190241b6bb16d72a2ad8f2ff004845f/recbole/data/utils.py#L130-L132 where train_batch_size or eval_batch_size is not checked.

But I'm not sure whether it is designed on purpose, because these arguments usually are fixed. So I wonder if it is needed to add the two args to the above lines.

Paitesanshi commented 1 year ago

@ShadowTinker We think that user loading existing data_loader means using fixed batch_size. We will consider whether to modify this setting after collecting more user feedback. Thanks for your suggestion again!

ShadowTinker commented 1 year ago

@Paitesanshi Thank you for the explanation. I've reviewed the modifications in #1698 and It will be my honor to contribute to recbole. Thank you again for the great repo and contribution to the community.

christopheralex commented 10 months ago

This fails again when this is run in multiprocessing with 4GPUs on 1 node. One process writes to disk while another process thinks the file exists and tries to read. Even if this doesnt happen since 4 processes write to disk in a pickle file with the mode "wb", the file is going to be corrupt on loading it later for inference.