Does this code include the Semantic String Docid method that cluster the docid in the paper?

ArvinZhuang / DSI-transformers

A huggingface transformers implementation of "Transformer Memory as a Differentiable Search Index"

MIT License

155 stars 14 forks source link

Does this code include the Semantic String Docid method that cluster the docid in the paper? #2

Closed CheaSim closed 2 years ago

CheaSim commented 2 years ago

Hi, Maybe Semantic String Docid will help improve the performance of DSI? In the data/create_NQ_train_vali.py, it uses random doc id.

ArvinZhuang commented 2 years ago

Hi, yep it should help according to the original paper. I also find if further randomize naive ids by not using acceding order for each docs but shuffle all the ids will also increase the scores quit a bit.

CheaSim commented 2 years ago

Thanks for your reply.

ArvinZhuang commented 2 years ago

I don't have semantic id implemented, I may have a try in the next month. If you want to have a try, you are also welcome to open a PR to add this feature!

CheaSim commented 2 years ago

I implemented it with another nlp task, but it doesn't work, I may try to use the semantic id with your code.