[Bug] Low Performance Due to Constraint in Docid Generation (Limited to Integer Docids).

ArvinZhuang / DSI-transformers

A huggingface transformers implementation of "Transformer Memory as a Differentiable Search Index"

MIT License

155 stars 14 forks source link

[Bug] Low Performance Due to Constraint in Docid Generation (Limited to Integer Docids). #9

Open isHuangXin opened 5 months ago

isHuangXin commented 5 months ago

Three types of docid representations are introduced in the paper "Transformer Memory as a Differentiable Search Index," namely, Unstructured Atomic Identifiers, Naively Structured String Identifiers, and Semantically Structured Identifiers.

In your code, you currently implement only the first type, Unstructured Atomic Identifiers. In the decoding phase, only integer docids are generated. I believe that the potential cause of lower performance compared to the source paper might be the suboptimal selection of INT_TOKEN_IDS.

I suggest to remove this section and retrain the DSI model.

isHuangXin commented 5 months ago

May I add you on WeChat for further communication? My WeChat ID is: is_HuangXin.

ArvinZhuang commented 5 months ago

Hi @isHuangXin , thanks for the comments! Yes, in my demo code, I only use Unstructured Atomic Identifiers and consider only integer document IDs. The reason I constrained generated docids to be integers is that all actual docids are random integers. Thus, there is no need to generate tokens other than integers during the beam search phase. I believe this constraint is actually helping the model generation.

Unstructured Atomic Identifiers are indeed suboptimal; however, my code cannot achieve a performance similar to the performance of Unstructured Atomic Identifiers reported in the original paper.

isHuangXin commented 5 months ago

I encountered the same issue, and it has been quite confusing for me.

Currently, I am attempting to retrain your another work DSI-QG model.