[💡SUG] 您好我们团队由于研究急切需求两个数据集，请问能否增加此两个数据集的预处理脚本文件，若可以提供不胜感激

w764567792 commented 2 years ago

hyp1231 commented 2 years ago

您好，原子文件其实就是类似 tsv 的文本文件格式，只是列名标注了本列的类型。建议您参考原子文件的格式，以及您已经下载的数据集的例子（可用任意文本编辑器打开），自行尝试转换，这个转换是非常容易的。也欢迎您生成原子文件后向 RecBole 主项目提 PR，非常感谢您的贡献。

w764567792 commented 2 years ago

您好，Podcast数据集是sqlite 类型，请问您有转换方法吗？

hyp1231 commented 2 years ago

您好，我也没有处理过 sqlite 的数据，您可以查一下 sqlite 如何导出文本文件。

w764567792 commented 2 years ago

好的我随后查询，还有就是我将spotify转换成原子文件 inter时，数据集信息可以读出来，但报错 Traceback (most recent call last): File "run_recbole_gnn.py", line 15, in run_recbole_gnn(model=args.model, dataset=args.dataset, config_file_list=config_file_list) File "/media/wl/f775d2d2-3a2c-4bc9-a196-74cf23d1726a/推荐系统/RecBole-GNN-main/recbole_gnn/quick_start.py", line 29, in run_recbole_gnn dataset = create_dataset(config) File "/media/wl/f775d2d2-3a2c-4bc9-a196-74cf23d1726a/推荐系统/RecBole-GNN-main/recbole_gnn/utils.py", line 56, in create_dataset dataset = dataset_class(config) File "/media/wl/f775d2d2-3a2c-4bc9-a196-74cf23d1726a/推荐系统/RecBole-GNN-main/recbole_gnn/data/dataset.py", line 16, in init super().init(config) File "/media/wl/f775d2d2-3a2c-4bc9-a196-74cf23d1726a/推荐系统/RecBole-GNN-main/RecBole/recbole/data/dataset/dataset.py", line 96, in init self._from_scratch() File "/media/wl/f775d2d2-3a2c-4bc9-a196-74cf23d1726a/推荐系统/RecBole-GNN-main/RecBole/recbole/data/dataset/dataset.py", line 106, in _from_scratch self._load_data(self.dataset_name, self.dataset_path) File "/media/wl/f775d2d2-3a2c-4bc9-a196-74cf23d1726a/推荐系统/RecBole-GNN-main/RecBole/recbole/data/dataset/dataset.py", line 247, in _load_data self._load_inter_feat(token, dataset_path) File "/media/wl/f775d2d2-3a2c-4bc9-a196-74cf23d1726a/推荐系统/RecBole-GNN-main/RecBole/recbole/data/dataset/dataset.py", line 270, in _load_inter_feat inter_feat = self._load_feat(inter_feat_path, FeatureSource.INTERACTION) File "/media/wl/f775d2d2-3a2c-4bc9-a196-74cf23d1726a/推荐系统/RecBole-GNN-main/RecBole/recbole/data/dataset/dataset.py", line 416, in _load_feat field, ftype = field_type.split(':') ValueError: too many values to unpack (expected 2)

我的数据集文件是这样的： user_id:token song_id:token previous_song:float skipped:float 1 2923 -1 4 1 3647 2923 4 1 1941 3647 4 1 2844 1941 4 1 1062 2844 4 1 2708 1062 2 1 649 2708 4 1 4247 649 4 1 1843 4247 2 1 1083 1843 0 1 2729 1083 4

hyp1231 commented 2 years ago

您好，默认 atomic files 的列是用 \t 来分隔的，请参考 atomic-file-format。

w764567792 commented 2 years ago

非常感谢，实际上我是用我的数据集覆盖掉原数据集解决的，直接用会报错，希望贵团队能够做出改进

hyp1231 commented 2 years ago

感谢反馈，请问能详细解释下“数据集覆盖掉原数据集”的具体操作吗

w764567792 commented 2 years ago

就是比如我有我的数据集，按您代码的原子文件整理好，然后我先下载一个项目处理好的数据集，比如ml-1m，然后把我的数据集复制，粘贴到ml-1m.inter中，就可以训练了

hyp1231 commented 2 years ago

好的，可能存在一些配置不一致的问题。刚刚测试了一个新处理的数据集，暂时未发生错误。

w764567792 commented 2 years ago

您好可以说一下怎样处理新数据集并加入训练吗

hyp1231 commented 2 years ago

一般是数据集先处理成原子文件的格式，然后放到 dataset/ 那个文件夹里（注意文件夹格式和其他数据集类似），最后检查配置文件里的 USER_ID_FIELD 等列名和原子文件中是否对应。一般这三步后就可以正常用了。

w764567792 commented 2 years ago

好的，还有是我刚刚发现一直在用CPU训练，请问如何输入指令GPU训练模型？

hyp1231 commented 2 years ago

好的，还有是我刚刚发现一直在用CPU训练，请问如何输入指令GPU训练模型？

请参考 https://github.com/RUCAIBox/RecBole/issues/1057 和 https://github.com/RUCAIBox/RecBole/issues/1263

w764567792 commented 2 years ago

好的，我再按您所说添加原子文件的方法使用数据集时，报下面的错 16 May 02:12 WARNING In inter_feat, line [3], item_id do not exist, so they will be removed. Traceback (most recent call last): File "run_recbole_gnn.py", line 15, in run_recbole_gnn(model=args.model, dataset=args.dataset, config_file_list=config_file_list) File "/media/wl/f775d2d2-3a2c-4bc9-a196-74cf23d1726a/推荐系统/RecBole-GNN-main/recbole_gnn/quick_start.py", line 29, in run_recbole_gnn dataset = create_dataset(config) File "/media/wl/f775d2d2-3a2c-4bc9-a196-74cf23d1726a/推荐系统/RecBole-GNN-main/recbole_gnn/utils.py", line 56, in create_dataset dataset = dataset_class(config) File "/media/wl/f775d2d2-3a2c-4bc9-a196-74cf23d1726a/推荐系统/RecBole-GNN-main/recbole_gnn/data/dataset.py", line 16, in init super().init(config) File "/media/wl/f775d2d2-3a2c-4bc9-a196-74cf23d1726a/推荐系统/RecBole-GNN-main/RecBole/recbole/data/dataset/dataset.py", line 96, in init self._from_scratch() File "/media/wl/f775d2d2-3a2c-4bc9-a196-74cf23d1726a/推荐系统/RecBole-GNN-main/RecBole/recbole/data/dataset/dataset.py", line 108, in _from_scratch self._data_processing() File "/media/wl/f775d2d2-3a2c-4bc9-a196-74cf23d1726a/推荐系统/RecBole-GNN-main/RecBole/recbole/data/dataset/dataset.py", line 151, in _data_processing self._data_filtering() File "/media/wl/f775d2d2-3a2c-4bc9-a196-74cf23d1726a/推荐系统/RecBole-GNN-main/RecBole/recbole/data/dataset/dataset.py", line 178, in _data_filtering self._reset_index() File "/media/wl/f775d2d2-3a2c-4bc9-a196-74cf23d1726a/推荐系统/RecBole-GNN-main/RecBole/recbole/data/dataset/dataset.py", line 828, in _reset_index raise ValueError('Some feat is empty, please check the filtering settings.') ValueError: Some feat is empty, please check the filtering settings.

hyp1231 commented 2 years ago

首先检查程序运行时，ITEM_ID_FIELD 和您新处理的原子文件的物品 ID 那一列名字是否对应。

然后检查 .inter 文件中，是否存在 item id 缺失的行，因为从 WARNING 看到您文件的第三行的 item_id 似乎是 NaN。

如果以上都没问题，可能是因为数据集筛选策略过强，导致交互记录都被筛空了。请检查 user_inter_num_interval 和 item_inter_num_interval 这两个值是否为合理的数值。

w764567792 commented 2 years ago

按照您的方法仔细检查以后，成功改为GPU训练，数据集也正常使用了，太不容易了，非常喜欢您团队这个项目，并感谢您的耐心解答！

hyp1231 commented 2 years ago

由于较长时间无更新，现关闭此 issue，若有其他疑问欢迎在新的 issue 中交流。

RUCAIBox / RecBole-GNN

[💡SUG] 您好我们团队由于研究急切需求两个数据集，请问能否增加此两个数据集的预处理脚本文件，若可以提供不胜感激 #38