allenai / specter

SPECTER: Document-level Representation Learning using Citation-informed Transformers
Apache License 2.0
508 stars 55 forks source link

Can we get calculated embeddings directly into variable? #10

Open kieron15 opened 4 years ago

kieron15 commented 4 years ago

embed.py script calculates embeddings of inputs in sample-metadata.json and stores them in output file output.jsonl

python scripts/embed.py \
--ids data/sample.ids --metadata data/sample-metadata.json \
--model ./model.tar.gz \
--output-file output.jsonl \
--vocab-dir data/vocab/ \
--batch-size 16 \
--cuda-device -1

Is there any way to call the function directly from python script and get the output embeddings in a variable? ...something like the web-api is doing but locally, offline I do not want to load the model for each evaluation call

armancohan commented 4 years ago

We are limited by the allennlp's predict interface which only supported reading input from a file and writing the corresponding output through running a predict command. If you need this feature, you need to modify allennlp's behavior. An starting point is investigating the predict command file.

kieron15 commented 4 years ago

@armancohan I am not familiar with allenlp. I tried loading the model using transformers library by defining following class and using config file from SciBERT model

from transformers import BertPreTrainedModel, BertModel

class GetSpecterEmbeddings(BertPreTrainedModel):

    def __init__(self, config):
        super(GetSpecterEmbeddings, self).__init__(config)
        self.bert = BertModel(config)

        self.init_weights()

    def forward(self, input_ids=None, attention_mask=None, token_type_ids=None,
                position_ids=None, head_mask=None, labels=None):

        outputs = self.bert(input_ids,
                            attention_mask=attention_mask,
                            token_type_ids=token_type_ids,
                            position_ids=position_ids,
                            head_mask=head_mask)

        sequence_output = outputs[0]
        pulled_output = outputs[1]
        return sequence_output, pulled_output

Following extra keys were found in the SPECTER model as compared to SciBERT model. I removed those keys and successfully loaded the remaining model weights.

'"text_field_embedder.token_embedder_bert._scalar_mix.gamma"',
'"text_field_embedder.token_embedder_bert._scalar_mix.scalar_parameters.0"',
'"text_field_embedder.token_embedder_bert._scalar_mix.scalar_parameters.1"',
'"text_field_embedder.token_embedder_bert._scalar_mix.scalar_parameters.2"',
'"text_field_embedder.token_embedder_bert._scalar_mix.scalar_parameters.3"',
'"text_field_embedder.token_embedder_bert._scalar_mix.scalar_parameters.4"',
'"text_field_embedder.token_embedder_bert._scalar_mix.scalar_parameters.5"',
'"text_field_embedder.token_embedder_bert._scalar_mix.scalar_parameters.6"',
'"text_field_embedder.token_embedder_bert._scalar_mix.scalar_parameters.7"',
'"text_field_embedder.token_embedder_bert._scalar_mix.scalar_parameters.8"',
'"text_field_embedder.token_embedder_bert._scalar_mix.scalar_parameters.9"',
'"text_field_embedder.token_embedder_bert._scalar_mix.scalar_parameters.10"',
'"text_field_embedder.token_embedder_bert._scalar_mix.scalar_parameters.11"',
'"venue_field_embedder.token_embedder_tokens.weight"',
'"feedforward._linear_layers.0.weight"',
'"feedforward._linear_layers.0.bias"',
'"layer_norm.gamma"',
'"layer_norm.beta"',
'"layer_norm_word_embedding.gamma"',
'"layer_norm_word_embedding.beta"',
'"layer_norm_word_embedding_venue.gamma"',
'"layer_norm_word_embedding_venue.beta"

I used vocab.txt from scibert_scivocab_uncased model for tokenizer and encoded the article as

input_ids     : [CLS]Title_tokens.[SEP]Abstract_tokens[SEP]...padding till length 512
token_type_ids: 000000000000000000000001111111111111111111100000000000000000000000000
attention_mask: 111111111111111111111111111111111111111111100000000000000000000000000

The embedding I got using this inputs model are different from the code in this repo. Below I have listed potential sources of this issue. Please help me in identifying the problem.

  1. Are any of the keys I omitted above while loading the SPECTER model weights used for getting embeddings? If yes, how do I modify GetSpecterEmbeddings class above
  2. Is the vocab file correct?
  3. Is the encoding of title and abstract correct?
ibeltagy commented 4 years ago

This is clever, thanks @kieron15 for trying it.

Are any of the keys ... used for getting embeddings?

Yes, you need the _scalar_mix parameters but not the rest. AllenNLP embeds the input as a linear combination of all the BERT layers instead of using the embeddings from the last one.

You will need this line in your __init__ function and this line in your forward function. You will also need to configure your BertModel to return embeddings of all layers.

Is the vocab file correct?

Yes

Is the encoding of title and abstract correct?

input_ids: You might be missing this if statement which limits length of title and abstract. max_sequence_length is set to 200.

token_type_ids: it is probably correct but it would be good to doublecheck. Please try to put a breakpoint here and check that you got input_ids, token_type_ids and attention_mask right.

A 4th reason could be dropout and layernorm. Are you using model.eval()?

We would love to merge this into our repo. Please feel free to open a PR with your model and a sample script to call it, and we will be happy to merge it.

kieron15 commented 4 years ago

@ibeltagy Thank for very much for the specific instructions.

I have modified __init__ and forward function in my class above as per your instructions.

By putting checkpoints in bert_token_embedder.py I have verified that the outputs of bert model and mix variable match.

But the output of BertEmbedder class is not the final SPECTER embeddings. It outputs tensor of size torch.Size([1, 200, 768]) here. The max_sequence_length of 200 seems to be for word-tokens and not for word-piece tokens.

Please tell me how do I get final SPECTER embeddings from this tensor. Also can you point me towards way to obtain word-token offsets used in this function?

ibeltagy commented 4 years ago

As the config indicates,

        "title_encoder": {
            "type": "boe",
            "embedding_dim": 768
        },

we go from a sequence of word embeddings to a document embedding using a simple bag-of-embeddings approach that just sums up the embeddings as in here.

To compute the offsets, first tokenize the sentence into tokens using this, then compute the offsets using this.

kieron15 commented 4 years ago

Thanks @ibeltagy for all the help. I am almost there.

I have verified that I am getting word-tokens list correctly (I had to use BertBasicWordSplitter instead of SimpleWordSplitter).

For getting word-piece tokens and tokens from this function, I need to send correct Vocabulary object as input. I tried sending empty Vocabulary object but it outputs offsets that are slightly off and word-piece tokens list without [CLS] and final [SEP] tokens .

  1. Please tell me how to initialize Vocabulary to have following contents.
Vocabulary with namespaces:
    Non Padded Namespaces: {'*labels', '*tags'}
    Namespace: venue, Size: 9017 
    Namespace: tokens, Size: 274941 
    Namespace: author, Size: 185030 
    Namespace: author_positions, Size: 10 
    Namespace: bert, Size: 31092 

I do have author.txt author_positions.txt non_padded_namespaces.txt tokens.txt venue.txt files from model.tar.gz but how do I add them to Vocabulary object.

  1. Also how do I generate vocab dict needed here from SciBERT vocab file.
ibeltagy commented 4 years ago

Sorry I was late.

  1. Can you try:
from allennlp.data.vocabulary import Vocabulary
vocab = Vocabulary.from_files('path_to_the_vocabulary_dir')
  1. I think you don't need to instantiate WordpieceIndexer. Instead, try instantiating PretrainedBertIndexer where pretrained_model='scibert_scivocab_uncased'
ibeltagy commented 4 years ago

@kieron15, I am curious if you have updates about this.

kieron15 commented 4 years ago

@ibeltagy Thanks for the input regarding PretrainedBertIndexer. Instantiating PretrainedBertIndexer does work but the token_type_ids and attention_mask returned still need to be modified before calling the SPECTER model. To verify that all the calculation steps are right the transformers library based implementation will need to be tested for number of examples

Anyway I just wanted the behavior already implemented in the web-api of this model. I stumbled across this method of SpecterPredictor. Instantiating this class and using the method should give the embeddings output in a variable. I am working on this.

malteos commented 3 years ago

I had a same issue and found the following code to work:

from allennlp.models import load_archive
from specter.predict_command import predictor_from_archive

# load to register
from specter.model import Model
from specter.data import DataReader, DataReaderFromPickled
from specter.predictor import SpecterPredictor

archive_path = './model.tar.gz'
metadata_path = './metadata.json'
included_text_fields = 'abstract title'
vocab_dir = 'data/vocab/'

overrides = f"{{'model':{{'predict_mode':'true','include_venue':'false'}},'dataset_reader':{{'type':'specter_data_reader','predict_mode':'true','paper_features_path':'{metadata_path}','included_text_fields': '{included_text_fields}'}},'vocabulary':{{'directory_path':'{vocab_dir}'}}}}"

archive = load_archive(archive_path, overrides=overrides)
predictor = predictor_from_archive(archive, predictor_name='specter_predictor', paper_features_path=metadata_path)

embed_papers = predictor.predict_json(dict(
    paper_id='your paper id',
    title='representation learning of scientific documents',
    abstract='we propose a new model for representing abstracts'
))

For some reasons the metadata.json needs to exists to make predictor_from_archive not fail.

malteos commented 3 years ago

@armancohan @ibeltagy

I'd like to use SPECTER as part of a larger PyTorch module so using the Predictor is not feasible (also for training). The following code seems to work. However, the resulting embedding does not match the embedding from Predictor.

Any ideas why this could be?

import torch
from allennlp.data import Token
from allennlp.data.token_indexers import PretrainedBertIndexer
from allennlp.models import load_archive

from allennlp.data.vocabulary import Vocabulary
from transformers import BertTokenizerFast

# load to register
from specter.model import Model

archive_path = './model.tar.gz'
metadata_path = './metadata.json'
included_text_fields = 'abstract title'
vocab_dir = 'data/vocab/'

overrides = f"{{'model':{{'predict_mode':'true','include_venue':'false'}},'dataset_reader':{{'type':'specter_data_reader','predict_mode':'true','paper_features_path':'{metadata_path}','included_text_fields': '{included_text_fields}'}},'vocabulary':{{'directory_path':'{vocab_dir}'}}}}"

# load from archive file
archive = load_archive(archive_path, overrides=overrides)

model = archive.model
vocab = Vocabulary.from_files(vocab_dir)

tokenizer_path = '/Volumes/data/repo/data/bert/scibert-scivocab-uncased'

indexer = PretrainedBertIndexer(tokenizer_path)
tokenizer = BertTokenizerFast.from_pretrained(tokenizer_path)

tokens = [Token(t) for t in tokenizer.tokenize('representation learning of scientific documents')]
instance = indexer.tokens_to_indices(tokens, vocab, 'bert')
tensor_instance = {k: torch.tensor([v]) for k, v in instance.items()}
model_out = model(source_title=tensor_instance)

print(model_out['embedding'])
SbstnErhrdt commented 3 years ago

I managed to run it with the following code:

import os

from allennlp.models import load_archive
from specter.predict_command import predictor_from_archive

# load to register
from specter.model import Model
from specter.data import DataReader, DataReaderFromPickled
from specter.predictor import SpecterPredictor

# load the model files
archive_path = os.path.dirname(os.path.abspath(__file__)) + '/model.tar.gz'
metadata_path = os.path.dirname(os.path.abspath(__file__)) + '/metadata.json'
included_text_fields = 'abstract title'
vocab_dir = os.path.dirname(os.path.abspath(__file__)) + '/data/vocab/'

overrides = f"{{'model':{{'predict_mode':'true','include_venue':'false'}},'dataset_reader':{{'type':'specter_data_reader','predict_mode':'true','paper_features_path':'{metadata_path}','included_text_fields': '{included_text_fields}'}},'vocabulary':{{'directory_path':'{vocab_dir}'}}}}"

archive = load_archive(archive_path, overrides=overrides)
predictor = predictor_from_archive(archive, predictor_name='specter_predictor', paper_features_path=metadata_path)
# disable paper cache
predictor._dataset_reader.use_paper_feature_cache = False

def encode(data: dict):
    """
    Encodes the data with the specter encoder
    :param data:
    :return:
    """
    # ignore the paper id
    data["paper_id"] = " "
    result = predictor.predict_json(data)
    if "paper_id" in data:
        del data["paper_id"]
    return result

if __name__ == '__main__':
    a = encode(dict(
        paper_id='1',
        title='This is a random title',
        abstract='This is a random abstract'
    ))
    b = encode(dict(
        paper_id='2',
        title='This is a random title',
        abstract='This is a random abstract'
    ))
    c = encode(dict(
        paper_id='1',
        title='This is a another random title but the document has the same id',
        abstract='This is another random abstract'
    ))
    assert a["embedding"] == b["embedding"]
    assert a["embedding"] != c["embedding"]

You still need to have the metadata.json file.

But it is necessary to have to following line so that the results make sense:

predictor._dataset_reader.use_paper_feature_cache = False

Otherwise, the script tries to use the "cache" and only returns "random" results based on the "paper_id".