[🐛BUG] Implausible metrics?

deklanw commented 3 years ago

Trying out my implementation of SLIM with ElasticNet https://github.com/RUCAIBox/RecBole/pull/621 I'm noticing some implausible numbers. Dataset is ml-100k with all defaults. Using default hyperparameters of my method defined in its yaml file (not yet well-chosen because these results are so off) https://github.com/RUCAIBox/RecBole/blob/41a06e59ab26482dbfac641caac99876c167168c/recbole/properties/model/SLIMElastic.yaml

Using this standard copy-pasted code

dataset_name = "ml-100k"

model = SLIMElastic

config = Config(model=model, dataset=dataset_name)
init_seed(config['seed'], config['reproducibility'])

# logger initialization
init_logger(config)
logger = getLogger()

logger.info(config)

# dataset filtering
dataset = create_dataset(config)
logger.info(dataset)

# dataset splitting
train_data, valid_data, test_data = data_preparation(config, dataset)

# model loading and initialization
model = model(config, train_data).to(config['device'])
logger.info(model)

# trainer loading and initialization
trainer = Trainer(config, model)

# model training
best_valid_score, best_valid_result = trainer.fit(train_data, valid_data)

# model evaluation
test_result = trainer.evaluate(test_data)

logger.info('best valid result: {}'.format(best_valid_result))
logger.info('test result: {}'.format(test_result))

Results: INFO test result: {'recall@10': 0.8461, 'mrr@10': 0.5374, 'ndcg@10': 0.7102, 'hit@10': 1.0, 'precision@10': 0.6309}

Also, my HyperOpt log is highly suspicious

alpha:0.316482837679784, hide_item:False, l1_ratio:0.9890017268444972, positive_only:False
Valid result:
recall@10 : 0.8461    mrr@10 : 0.5368    ndcg@10 : 0.7099    hit@10 : 1.0000    precision@10 : 0.6309    
Test result:
recall@10 : 0.8461    mrr@10 : 0.5374    ndcg@10 : 0.7102    hit@10 : 1.0000    precision@10 : 0.6309

...

alpha:0.47984629320482386, hide_item:False, l1_ratio:0.9907136437218732, positive_only:True
Valid result:
recall@10 : 0.8461    mrr@10 : 0.5368    ndcg@10 : 0.7099    hit@10 : 1.0000    precision@10 : 0.6309    
Test result:
recall@10 : 0.8461    mrr@10 : 0.5374    ndcg@10 : 0.7102    hit@10 : 1.0000    precision@10 : 0.6309

...

alpha:0.9530393537754144, hide_item:True, l1_ratio:0.24064058250190196, positive_only:True
Valid result:
recall@10 : 0.6251    mrr@10 : 0.3611    ndcg@10 : 0.4954    hit@10 : 0.9650    precision@10 : 0.4709    
Test result:
recall@10 : 0.6535    mrr@10 : 0.4012    ndcg@10 : 0.5357    hit@10 : 0.9745    precision@10 : 0.5019

Exact same results with different parameters?

I figure if there is a mistake in my implementation it would cause bad performance, not amazing performance.

Anyone know what could be causing this?

ShanleiMu commented 3 years ago

I print the prediction scores. It seems all the scores are 0, which causes the implausible metrics.

To speed up the full sort prediction, we put the ground truth items at the beginning of the list to be sorted and it is a stable sort. So if all the candidate items have the same score, it will achieve high performance.

1226

I must admit that the current sorting evaluation method is not very good.

deklanw commented 3 years ago

@ShanleiMu Ah! I see. That is tricky.

The output of all 0s is I, I believe, caused by the hyperparameter controlling the L1 coefficient in the regression being too large. I was hoping to use HyperOpt to determine what reasonable values for that coefficient are (it's probably problem-specific), but with the evaluation working like this, I don't see how I could make the determination!

Is there a way around this?

tsotfsk commented 3 years ago

I print the prediction scores. It seems all the scores are 0, which causes the implausible metrics.

To speed up the full sort prediction, we put the ground truth items at the beginning of the list to be sorted and it is a stable sort. So if all the candidate items have the same score, it will achieve high performance.

I must admit that the current sorting evaluation method is not very good.

Thanks for Shan Lei's quick reply. I'd like to make some supplementary explanations. In fact, none of the TopK metrics we implement can handle the items which have the same score (GAUC is an exception, because GAUC uses the average rank as a solution). Frankly speaking, for deep learning recommendation algorithms, in my understanding, items have the same score is a very low probability event. Most open source recommendation codes, such as recommenders, NeuRec do not deal with this situation. But for the non deep learning algorithms, it may be common that some items have the same score, especially in the early stage of the training.

The sort trick we use will cause positive items always appear at the top of the items which have the same score. Because we have a re-organizing stage to speed up the evaluation. For more information about this trick, you can click evaluation.

Maybe we should keep the randomness of positive items‘ position when we have the same score. We will discuss this problem and try to come up with a feasible solution to alleviate this problem！

tsotfsk commented 3 years ago

Hi! @deklanw. I think randomly generating a small range of numbers and adding them to the score may solve this problem. Because it can avoid some items having the same score. If the model does not learn any information, the result will be very poor. If the model learns some obvious information, the randomly generated small number will not have a great impact on the result, which may help you to determine the range of the L1 parameter.

deklanw commented 3 years ago

@tsotfsk Thanks, that all makes sense.

The adding random noise idea worked well. But, it can't be a temporary solution because I believe the appropriate L1 hyperparameter depends on the problem. I've chosen hyperparameter defaults which work well on ml-100k but that's as far as I can tell.

def add_noise(t, mag=1e-5):
    return t + mag * torch.rand(t.shape)

...

    def predict(self, interaction):
        user = interaction[self.USER_ID].cpu().numpy()
        item = interaction[self.ITEM_ID].cpu().numpy()

        r = torch.from_numpy((self.interaction_matrix[user, :].multiply(
            self.item_similarity[:, item].T)).sum(axis=1).getA1())

        return add_noise(r)

    def full_sort_predict(self, interaction):
        user = interaction[self.USER_ID].cpu().numpy()

        r = self.interaction_matrix[user, :] @ self.item_similarity
        r = torch.from_numpy(r.todense().getA1())

        return add_noise(r)

I'm fine with leaving this in permanently.

Seem fine?

tsotfsk commented 3 years ago

Hi! @deklanw. :blush: your code looks fine, and I tested the HyperOpt module after adding noise and it also works well. The phenomenon of implausible metrics will disappear and the effect of noise on the results is very small and can be ignored.

deklanw commented 3 years ago

Thanks for the help

RUCAIBox / RecBole

[🐛BUG] Implausible metrics? #622