Open isHuangXin opened 5 months ago
May I add you on WeChat for further communication? My WeChat ID is: is_HuangXin.
Hi @isHuangXin , thanks for the comments! Yes, in my demo code, I only use Unstructured Atomic Identifiers and consider only integer document IDs. The reason I constrained generated docids to be integers is that all actual docids are random integers. Thus, there is no need to generate tokens other than integers during the beam search phase. I believe this constraint is actually helping the model generation.
Unstructured Atomic Identifiers are indeed suboptimal; however, my code cannot achieve a performance similar to the performance of Unstructured Atomic Identifiers reported in the original paper.
I encountered the same issue, and it has been quite confusing for me.
Currently, I am attempting to retrain your another work DSI-QG model
.
Three types of docid representations are introduced in the paper "Transformer Memory as a Differentiable Search Index," namely,
Unstructured Atomic Identifiers
,Naively Structured String Identifiers
, andSemantically Structured Identifiers
.In your code, you currently implement only the first type,
Unstructured Atomic Identifiers
. In the decoding phase, only integer docids are generated. I believe that the potential cause of lower performance compared to the source paper might be the suboptimal selection ofINT_TOKEN_IDS
.I suggest to remove this section and retrain the DSI model.