Concerns regrading data sampling

Miopha commented 2 days ago

Hi, I really appreciate your work and have following issues: I have observed that when I turned off federated learning (i.e., diasable downloading item embeddings from the server), specifically when I modified engine.py as follows:

# user_param_dict['embedding_item.weight'] = copy.deepcopy(self.server_model_param['embedding_item.weight'].data).cuda() user_param_dict['embedding_item.weight'] = user_param_dict['embedding_item.weight'].cuda()

and then i got a wonderful performance (HR@10 = 1.0000, NDCG@10 = 0.9775), which is abnormally better than the proposed method. So I tried to dig deeper into the issues and found that the negative items (used in the training process) were not being sampled in test set, code:

self.negatives = self._sample_negative(self.ratings) def _sample_negative(self, ratings): interact_status = ratings.groupby('userId')['itemId'].apply(set).reset_index().rename(columns={'itemId': 'interacted_items'}) interact_status['negative_items'] = interact_status['interacted_items'].apply(lambda x: self.item_pool - x) interact_status['negative_samples'] = interact_status['negative_items'].apply(lambda x: random.sample(x, 198)) return interact_status[['userId', 'negative_items', 'negative_samples']]

where the self.ratings contains all ratings (include test, train, val) of clients, since client in training process does not have access the test (or validation) data, so this potentially lead to data leakage? The correct code should be

self.negatives = self._sample_negative(self.train_ratings)

I'm not sure if I may have something missed, so I would appreciate any advice.

Miopha commented 2 days ago

And [1][2] seem to have the same issue, as they share very similar code for data sampling, which may also involve data leakage. [1] GPFedRec: Graph-Guided Personalization for Federated Recommendation. KDD 2024 [2] Federated recommendation with additive personalization. ICLR 2024

Zhangcx19 commented 1 day ago

Thank you for your interest in our work! I am glad to clarify this.

In your experimental setting, users train the recommendation model independently using their personal training data. As the number of training iterations increases, the model observes an increasing number of negative data samples. This may lead to overfitting, where observed negative samples are inferred with low scores, and unseen test items end up being ranked relatively high and achieving abnormally good performance. This issue is more likely to occur when the total number of items is small (e.g., ml-100k), but it is less likely to arise when the dataset is larger and more representative of real-world recommendation scenarios (e.g., lastfm-2k).

I would like to clarify that this overfitting phenomenon does not constitute data leakage. It can be alleviated by incorporating larger, more realistic recommendation datasets.

Additionally, in your proposed setup, the test item will be treated as a negative sample during training, which can lead to unstable training because the same item has conflicting labels (positive in the test set, negative during training).

I hope this explanation helps address your concerns!

Zhangcx19 / IJCAI-23-PFedRec

Concerns regrading data sampling #3