Closed Punchwes closed 3 years ago
Thanks for reaching out and sorry for the confusion. Please use [SEP] between title and abstract, as done in training. I updated the example usage in README accordingly-> https://github.com/allenai/specter/pull/23
Thanks very much.
Hi, thanks very much for your code, very interesting work. I am a little bit confused about some points in your code.
In your pytorch training file, it is clearly stated that you would concatenate title, abstract and separate them by [SEP]:
So I am expecting each input sequence should follow this strategy. While in your example usage of huggingface toolkit where you use a space ' ' to concatenate title and abstract:
It makes me confused.....Why you are changing from '[SEP]' to ' ' in the inference time..... and how is it compatible with your training scenario (since you are using [SEP] all the time during training)....
I tried with the tokenizer model (AutoTokenizer.from_pretrained('allenai/specter') to check the tokenized results on your given example: {'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'} ->
'BERT We introduce a new language representation model called BERT' ->
['[CLS]', 'ber', '##t', 'we', 'introduce', 'a', 'new', 'language', 'representation', 'model', 'called', 'ber', '##t', '[SEP]', '[PAD]', '[PAD]']
no [SEP] seems have been inserted between title and abstract.
Many thanks for any reply on this