Open dineshkh opened 3 years ago
@dineshkh: Yes, unfortunately the zeshel data hasn't been properly formatted it seems. The title is appended to the beginning of the text:
{
"title": "Steve Nelson ( American football )",
"text": "Steve Nelson ( American football ) Steven Lee Nelson ( born April 26 , 1951 in Farmington , Minnesota ) is a former professional American football linebacker who played for the New England Patriots from 1974 to 1987 . Nelson was a three sport athlete at Anoka High School earning letters in football , basketball and baseball . As a senior , Nelson was selected as captain , team MVP and to the all - state team in football . Nelson then went on to college at North Dakota State University and graduated from NDSU in 1974 after being named a two time All - American , team captain and MVP in football . He was selected by the Patriots in the 2nd round of the 1974 NFL Draft and missed only three games during his 14 - year NFL career in which he was named team MVP twice . He was selected to the Pro Bowl three times in 1980 , 1984 , and 1985 and his # 57 jersey was retired by the Patriots . He is credited with helping the Patriots reach Super Bowl XX versus the Chicago Bears . After his football retirement , Nelson was the athletic director and head coach at Curry College from 1998 - 2006 ( football coach through 2005 season ) . He currently works as a business development executive for Lighthouse Computer Services , Inc . , a Lincoln , RI - based technology company . In September , 2011 , Nelson was named to the inaugural class of the Anoka High School Hall of Fame . Nelson and his wife Angela reside in Middleboro , MA and he is the father of five daughters ; Cameron , Casey , Caitlin , Kelli and Grace .",
"document_id": "000AD03A11171AA2"
}
As for it using only first 256 characters, I think the max candidate token length that biencoder takes is 128 by default, so it is probably a safe assumption that 256 characters would yield less than or at most equal to 128 BERT wordpiece tokens. You could take 512 to be on a safer side.
@ledw I was training the Bi-Encoder on Zero-shot EL dataset. I found out that the "load_entity_dict_zeshel" function in the "zeshel_utils.py" file uses only the first 256 characters of the entity description to create entity representation. While the paper mention the input to the entity representation model is both entity title and first ten sentences of the description. What is the reason for this difference?
Also, I have trained the Bi-Encoder and cross encoder with the instructions provided on the git. I am getting Un-Normalized accuracy: 55.01% with BERT-base while the paper mention 61.34%. I think one of the main reasons for the difference in the results is negative sampling, which is currently not implemented in the code.
Is there any plan to release the implementation of negative sampling?