Open kieron15 opened 4 years ago
We are limited by the allennlp's predict
interface which only supported reading input from a file and writing the corresponding output through running a predict command. If you need this feature, you need to modify allennlp's behavior. An starting point is investigating the predict command file.
@armancohan I am not familiar with allenlp. I tried loading the model using transformers
library by defining following class and using config file from SciBERT model
from transformers import BertPreTrainedModel, BertModel
class GetSpecterEmbeddings(BertPreTrainedModel):
def __init__(self, config):
super(GetSpecterEmbeddings, self).__init__(config)
self.bert = BertModel(config)
self.init_weights()
def forward(self, input_ids=None, attention_mask=None, token_type_ids=None,
position_ids=None, head_mask=None, labels=None):
outputs = self.bert(input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask)
sequence_output = outputs[0]
pulled_output = outputs[1]
return sequence_output, pulled_output
Following extra keys were found in the SPECTER model as compared to SciBERT model. I removed those keys and successfully loaded the remaining model weights.
'"text_field_embedder.token_embedder_bert._scalar_mix.gamma"',
'"text_field_embedder.token_embedder_bert._scalar_mix.scalar_parameters.0"',
'"text_field_embedder.token_embedder_bert._scalar_mix.scalar_parameters.1"',
'"text_field_embedder.token_embedder_bert._scalar_mix.scalar_parameters.2"',
'"text_field_embedder.token_embedder_bert._scalar_mix.scalar_parameters.3"',
'"text_field_embedder.token_embedder_bert._scalar_mix.scalar_parameters.4"',
'"text_field_embedder.token_embedder_bert._scalar_mix.scalar_parameters.5"',
'"text_field_embedder.token_embedder_bert._scalar_mix.scalar_parameters.6"',
'"text_field_embedder.token_embedder_bert._scalar_mix.scalar_parameters.7"',
'"text_field_embedder.token_embedder_bert._scalar_mix.scalar_parameters.8"',
'"text_field_embedder.token_embedder_bert._scalar_mix.scalar_parameters.9"',
'"text_field_embedder.token_embedder_bert._scalar_mix.scalar_parameters.10"',
'"text_field_embedder.token_embedder_bert._scalar_mix.scalar_parameters.11"',
'"venue_field_embedder.token_embedder_tokens.weight"',
'"feedforward._linear_layers.0.weight"',
'"feedforward._linear_layers.0.bias"',
'"layer_norm.gamma"',
'"layer_norm.beta"',
'"layer_norm_word_embedding.gamma"',
'"layer_norm_word_embedding.beta"',
'"layer_norm_word_embedding_venue.gamma"',
'"layer_norm_word_embedding_venue.beta"
I used vocab.txt
from scibert_scivocab_uncased
model for tokenizer and encoded the article as
input_ids : [CLS]Title_tokens.[SEP]Abstract_tokens[SEP]...padding till length 512
token_type_ids: 000000000000000000000001111111111111111111100000000000000000000000000
attention_mask: 111111111111111111111111111111111111111111100000000000000000000000000
The embedding I got using this inputs model are different from the code in this repo. Below I have listed potential sources of this issue. Please help me in identifying the problem.
GetSpecterEmbeddings
class abovevocab
file correct?This is clever, thanks @kieron15 for trying it.
Are any of the keys ... used for getting embeddings?
Yes, you need the _scalar_mix
parameters but not the rest. AllenNLP embeds the input as a linear combination of all the BERT layers instead of using the embeddings from the last one.
You will need this line in your __init__
function and this line in your forward function. You will also need to configure your BertModel
to return embeddings of all layers.
Is the vocab file correct?
Yes
Is the encoding of title and abstract correct?
input_ids
:
You might be missing this if statement which limits length of title and abstract. max_sequence_length
is set to 200.
token_type_ids
: it is probably correct but it would be good to doublecheck. Please try to put a breakpoint here and check that you got input_ids
, token_type_ids
and attention_mask
right.
A 4th reason could be dropout and layernorm. Are you using model.eval()
?
We would love to merge this into our repo. Please feel free to open a PR with your model and a sample script to call it, and we will be happy to merge it.
@ibeltagy Thank for very much for the specific instructions.
I have modified __init__
and forward
function in my class above as per your instructions.
By putting checkpoints in bert_token_embedder.py
I have verified that the outputs of bert model and mix
variable match.
But the output of BertEmbedder
class is not the final SPECTER embeddings. It outputs tensor of size torch.Size([1, 200, 768])
here. The max_sequence_length
of 200 seems to be for word-tokens and not for word-piece tokens.
Please tell me how do I get final SPECTER embeddings from this tensor.
Also can you point me towards way to obtain word-token offsets
used in this function?
As the config indicates,
"title_encoder": {
"type": "boe",
"embedding_dim": 768
},
we go from a sequence of word embeddings to a document embedding using a simple bag-of-embeddings approach that just sums up the embeddings as in here.
To compute the offsets, first tokenize the sentence into tokens using this, then compute the offsets using this.
Thanks @ibeltagy for all the help. I am almost there.
I have verified that I am getting word-tokens list correctly (I had to use BertBasicWordSplitter
instead of SimpleWordSplitter
).
For getting word-piece tokens and tokens from this function, I need to send correct Vocabulary
object as input. I tried sending empty Vocabulary
object but it outputs offsets
that are slightly off and word-piece tokens list without [CLS]
and final [SEP]
tokens .
Vocabulary
to have following contents. Vocabulary with namespaces:
Non Padded Namespaces: {'*labels', '*tags'}
Namespace: venue, Size: 9017
Namespace: tokens, Size: 274941
Namespace: author, Size: 185030
Namespace: author_positions, Size: 10
Namespace: bert, Size: 31092
I do have author.txt author_positions.txt non_padded_namespaces.txt tokens.txt venue.txt
files from model.tar.gz but how do I add them to Vocabulary
object.
vocab
dict needed here from SciBERT vocab file.Sorry I was late.
from allennlp.data.vocabulary import Vocabulary
vocab = Vocabulary.from_files('path_to_the_vocabulary_dir')
WordpieceIndexer
. Instead, try instantiating PretrainedBertIndexer
where pretrained_model='scibert_scivocab_uncased'
@kieron15, I am curious if you have updates about this.
@ibeltagy Thanks for the input regarding PretrainedBertIndexer
.
Instantiating PretrainedBertIndexer
does work but the token_type_ids
and attention_mask
returned still need to be modified before calling the SPECTER model.
To verify that all the calculation steps are right the transformers
library based implementation will need to be tested for number of examples
Anyway I just wanted the behavior already implemented in the web-api of this model.
I stumbled across this method of SpecterPredictor
. Instantiating this class and using the method should give the embeddings output in a variable. I am working on this.
I had a same issue and found the following code to work:
from allennlp.models import load_archive
from specter.predict_command import predictor_from_archive
# load to register
from specter.model import Model
from specter.data import DataReader, DataReaderFromPickled
from specter.predictor import SpecterPredictor
archive_path = './model.tar.gz'
metadata_path = './metadata.json'
included_text_fields = 'abstract title'
vocab_dir = 'data/vocab/'
overrides = f"{{'model':{{'predict_mode':'true','include_venue':'false'}},'dataset_reader':{{'type':'specter_data_reader','predict_mode':'true','paper_features_path':'{metadata_path}','included_text_fields': '{included_text_fields}'}},'vocabulary':{{'directory_path':'{vocab_dir}'}}}}"
archive = load_archive(archive_path, overrides=overrides)
predictor = predictor_from_archive(archive, predictor_name='specter_predictor', paper_features_path=metadata_path)
embed_papers = predictor.predict_json(dict(
paper_id='your paper id',
title='representation learning of scientific documents',
abstract='we propose a new model for representing abstracts'
))
For some reasons the metadata.json
needs to exists to make predictor_from_archive
not fail.
@armancohan @ibeltagy
I'd like to use SPECTER as part of a larger PyTorch module so using the Predictor
is not feasible (also for training). The following code seems to work. However, the resulting embedding does not match the embedding from Predictor
.
Any ideas why this could be?
import torch
from allennlp.data import Token
from allennlp.data.token_indexers import PretrainedBertIndexer
from allennlp.models import load_archive
from allennlp.data.vocabulary import Vocabulary
from transformers import BertTokenizerFast
# load to register
from specter.model import Model
archive_path = './model.tar.gz'
metadata_path = './metadata.json'
included_text_fields = 'abstract title'
vocab_dir = 'data/vocab/'
overrides = f"{{'model':{{'predict_mode':'true','include_venue':'false'}},'dataset_reader':{{'type':'specter_data_reader','predict_mode':'true','paper_features_path':'{metadata_path}','included_text_fields': '{included_text_fields}'}},'vocabulary':{{'directory_path':'{vocab_dir}'}}}}"
# load from archive file
archive = load_archive(archive_path, overrides=overrides)
model = archive.model
vocab = Vocabulary.from_files(vocab_dir)
tokenizer_path = '/Volumes/data/repo/data/bert/scibert-scivocab-uncased'
indexer = PretrainedBertIndexer(tokenizer_path)
tokenizer = BertTokenizerFast.from_pretrained(tokenizer_path)
tokens = [Token(t) for t in tokenizer.tokenize('representation learning of scientific documents')]
instance = indexer.tokens_to_indices(tokens, vocab, 'bert')
tensor_instance = {k: torch.tensor([v]) for k, v in instance.items()}
model_out = model(source_title=tensor_instance)
print(model_out['embedding'])
I managed to run it with the following code:
import os
from allennlp.models import load_archive
from specter.predict_command import predictor_from_archive
# load to register
from specter.model import Model
from specter.data import DataReader, DataReaderFromPickled
from specter.predictor import SpecterPredictor
# load the model files
archive_path = os.path.dirname(os.path.abspath(__file__)) + '/model.tar.gz'
metadata_path = os.path.dirname(os.path.abspath(__file__)) + '/metadata.json'
included_text_fields = 'abstract title'
vocab_dir = os.path.dirname(os.path.abspath(__file__)) + '/data/vocab/'
overrides = f"{{'model':{{'predict_mode':'true','include_venue':'false'}},'dataset_reader':{{'type':'specter_data_reader','predict_mode':'true','paper_features_path':'{metadata_path}','included_text_fields': '{included_text_fields}'}},'vocabulary':{{'directory_path':'{vocab_dir}'}}}}"
archive = load_archive(archive_path, overrides=overrides)
predictor = predictor_from_archive(archive, predictor_name='specter_predictor', paper_features_path=metadata_path)
# disable paper cache
predictor._dataset_reader.use_paper_feature_cache = False
def encode(data: dict):
"""
Encodes the data with the specter encoder
:param data:
:return:
"""
# ignore the paper id
data["paper_id"] = " "
result = predictor.predict_json(data)
if "paper_id" in data:
del data["paper_id"]
return result
if __name__ == '__main__':
a = encode(dict(
paper_id='1',
title='This is a random title',
abstract='This is a random abstract'
))
b = encode(dict(
paper_id='2',
title='This is a random title',
abstract='This is a random abstract'
))
c = encode(dict(
paper_id='1',
title='This is a another random title but the document has the same id',
abstract='This is another random abstract'
))
assert a["embedding"] == b["embedding"]
assert a["embedding"] != c["embedding"]
You still need to have the metadata.json
file.
But it is necessary to have to following line so that the results make sense:
predictor._dataset_reader.use_paper_feature_cache = False
Otherwise, the script tries to use the "cache" and only returns "random" results based on the "paper_id".
embed.py
script calculates embeddings of inputs insample-metadata.json
and stores them in output fileoutput.jsonl
Is there any way to call the function directly from python script and get the output embeddings in a variable? ...something like the web-api is doing but locally, offline I do not want to load the model for each evaluation call