airKlizz / MsMarco

Re-ranking task using MS MARCO dataset and Hugging Face library
15 stars 2 forks source link

Why not using Huggingface classification pipeline? #1

Closed pommedeterresautee closed 4 years ago

pommedeterresautee commented 4 years ago

This project implements its own loss, etc. instead of calling Huggingface pipeline, is there a reason for that? Moreover, the loss implemented in Scorer class is a softmax instead of cross entropy like in https://arxiv.org/pdf/1901.04085.pdf, can you tell me why you have made this design choice? Would it be possible that it explains in part the MRR obtained vs the one published on the leaderboard?

Thank you for your answers

airKlizz commented 4 years ago

Hi, I didn't implement my own loss function. I use the CategoricalCrossentropy from tensorflow. Also, I didn't use the huggingface pipeline or the keras model fit() function because there are lots of data (so only one epoch) and I wanted to compute some metrics during the epoch. There is many little differences between the model I implemented and the one from the paper. First I use all tokens vectors instead of the [CLS] one as they did in the paper. Also, I guess they have one final neuron instead of 2 for me (but it doesn't make difference because of the softmax I think). Despite these differences, we use the same loss function otherwise I made an implementation mistake. Regarding the MRR obtained I think it is in major part because of the number of training steps. I used much less data for the training than they did. In this paper: https://arxiv.org/pdf/2003.06713.pdf (which is really interesting if you look at ranking passages) you can see that the MRR is low for thousands of training samples.

I hope I answered your questions!

pommedeterresautee commented 4 years ago

Thank you for your answer. Sorry, I am not familiar with TF, I was looking for the same architecture than for Pytorch model... I get why you avoided Huggingface pipelines. I am wondering, about your design choice to use all tokens, did you compare performance with using only CLS or is there any other reason?