allenai / specter

SPECTER: Document-level Representation Learning using Citation-informed Transformers
Apache License 2.0
508 stars 55 forks source link

Confused about the Huggingface Usage (using ' ' instead of '[SEP]' for concatenating) #22

Closed Punchwes closed 3 years ago

Punchwes commented 3 years ago

Hi, thanks very much for your code, very interesting work. I am a little bit confused about some points in your code.

In your pytorch training file, it is clearly stated that you would concatenate title, abstract and separate them by [SEP]:

title_field = instance.fields.get(f'{paper_type}_title')
abst_field = instance.fields.get(f'{paper_type}_abstract')
if title_field:
    tokens.extend(title_field.tokens)
if tokens:
    tokens.extend([Token('[SEP]')])
if abst_field:
    tokens.extend(abst_field.tokens)

So I am expecting each input sequence should follow this strategy. While in your example usage of huggingface toolkit where you use a space ' ' to concatenate title and abstract:

papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
          {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]

# concatenate title and abstract
title_abs = [d['title'] + ' ' + (d.get('abstract') or '') for d in papers]
# preprocess the input
inputs = tokenizer(title_abs, padding=True, truncation=True, return_tensors="pt", max_length=512)
result = model(**inputs)
# take the first token in the batch as the embedding
embeddings = result.last_hidden_state[:, 0, :]

It makes me confused.....Why you are changing from '[SEP]' to ' ' in the inference time..... and how is it compatible with your training scenario (since you are using [SEP] all the time during training)....

I tried with the tokenizer model (AutoTokenizer.from_pretrained('allenai/specter') to check the tokenized results on your given example: {'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'} ->

'BERT We introduce a new language representation model called BERT' ->

['[CLS]', 'ber', '##t', 'we', 'introduce', 'a', 'new', 'language', 'representation', 'model', 'called', 'ber', '##t', '[SEP]', '[PAD]', '[PAD]']

no [SEP] seems have been inserted between title and abstract.

Many thanks for any reply on this

armancohan commented 3 years ago

Thanks for reaching out and sorry for the confusion. Please use [SEP] between title and abstract, as done in training. I updated the example usage in README accordingly-> https://github.com/allenai/specter/pull/23

Punchwes commented 3 years ago

Thanks very much.