UKPLab / sentence-transformers

Multilingual Sentence & Image Embeddings with BERT
https://www.SBERT.net
Apache License 2.0
14.73k stars 2.43k forks source link

Long sentence embedding #364

Open leleyi opened 4 years ago

leleyi commented 4 years ago

Is there a limit on sentence length? I can get the same result using a very long sentence, which way. Thanks

nreimers commented 4 years ago

The pre trained models have set the max sequence length to 128 word pieces, but this can be increased if needed. BERT in general has a limit of 510 word pieces.

Inputs longer than this will be truncated.

leleyi commented 4 years ago

Thank you.and How can I increase the max sequence length?

------------------ 原始邮件 ------------------ 发件人: "Nils Reimers"<notifications@github.com>; 发送时间: 2020年8月17日(星期一) 下午3:44 收件人: "UKPLab/sentence-transformers"<sentence-transformers@noreply.github.com>; 抄送: "乐乐矣"<653574281@qq.com>; "Author"<author@noreply.github.com>; 主题: Re: [UKPLab/sentence-transformers] Long sentence embedding (#364)

The pre trained models have set the max sequence length to 128 word pieces, but this can be increased if needed. BERT in general has a limit of 510 word pieces.

Inputs longer than this will be truncated.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

nreimers commented 4 years ago

Not sure if it is in the last build already, but you can try.

model.max_seq_length = 510

Otherwise, the following works:

model._first_module().max_seq_length = 510
leleyi commented 4 years ago

Thank you very much; It helps me a lot

------------------ 原始邮件 ------------------ 发件人: "Nils Reimers"<notifications@github.com>; 发送时间: 2020年8月18日(星期二) 凌晨4:07 收件人: "UKPLab/sentence-transformers"<sentence-transformers@noreply.github.com>; 抄送: "乐乐矣"<653574281@qq.com>; "Author"<author@noreply.github.com>; 主题: Re: [UKPLab/sentence-transformers] Long sentence embedding (#364)

Not sure if it is in the last build already, but you can try. model.max_seq_length = 510
Otherwise, the following works: model._first_module().max_seq_length = 510
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

thesby commented 3 years ago

@nreimers After setting model.max_seq_length = 510, when I try to encode a text about 2000 words, I get an exception

   1812         # remove once script supports set_grad_enabled
   1813         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1814     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   1815
   1816

IndexError: index out of range in self

But If I don't set anything about model.max_seq_length, there is no exception with the same long text.

nreimers commented 3 years ago

Try a smaller value like model.max_seq_length = 500

Some models might add more than 2 special tokens.

thesby commented 3 years ago

@nreimers Great, thank you

PhilipMay commented 3 years ago

@nreimers when I train a sentence embedding like here: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/sts/training_stsbenchmark.py

Do you think it makes a difference when I decrease the sentence length of the language model (to 128 ) that I take for training?

nreimers commented 3 years ago

Hi @PhilipMay It depends on your train dataset.

How many sentences are longer than 128? It this is a small fraction, increasing or decreasing the limit will not change anything.

If most sentences are longer than 128, than changing the value can have an impact. The model just trains then on e.g. the first 128 word pieces of the respective sentences.

thesby commented 3 years ago

@nreimers I use the model xlm-r-100langs-bert-base-nli-stsb-mean-tokens, which supports 512 tokens. Can I change the max to 1024 by setting model.max_seq_length = 1024

An exception occurs when I encode a long text of 2000 words if I set model.max_seq_length = 1024

   1812         # remove once script supports set_grad_enabled
   1813         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1814     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   1815
   1816

IndexError: index out of range in self
nreimers commented 3 years ago

Hi @thesby BERT is limited to 512 tokens (some tokens are reserved for special tokens like [CLS] and [SEP]). Same for XLM-R. Setting max_seq_length to values larger than 509 / 510 will not work.

thesby commented 3 years ago

I got it, thank you

lefnire commented 3 years ago

(Correct me if I'm wrong UKPLab) you could also use a transformers model that handles larger sequences, like Longformer:

word_embedding_model = models.Transformer('allenai/longformer-base-4096')
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

4096 in that case. Or you could do some smart batching, where you embed larger paragraphs (eg with Longformer) and mean them all together, eg if you want the embedding of a document or multiple documents. See gpu/nlp.py for batching over multiple paragraphs/documents, then just np.mean(embeddings, axis=0)

jtank38 commented 3 years ago

I too was also looking for something like Longformer. I basically want document embeddings, I have tried average sentence embeddings (using sentence transformers) but it's a very naive approach it seems.

lefnire commented 3 years ago

@jtank38 I think embeddings.mean() isn't naive - it's used in ukplab examples. But doing it over sentences will probably dilute a lot, IMO; seems better to mean longer junks like paragraphs?

thesby commented 3 years ago

(Correct me if I'm wrong UKPLab) you could also use a transformers model that handles larger sequences, like Longformer:

word_embedding_model = models.Transformer('allenai/longformer-base-4096')
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

4096 in that case. Or you could do some smart batching, where you embed larger paragraphs (eg with Longformer) and mean them all together, eg if you want the embedding of a document or multiple documents. See gpu/nlp.py for batching over multiple paragraphs/documents, then just np.mean(embeddings, axis=0)

@lefnire I am using this model too. But the output embeddings are very similar. So I want to convert xlm-r-bert-base-nli-stsb-mean-tokens to be a longformer model, then load the longformer with sentence_transformers.

But I get stuck at the first step, how to convert the model to be a longformer model? Any suggestion?

nreimers commented 3 years ago

@thesby Not sure how to do that. You would need do create a longformer structure similar to xlm-r, but then change the attention mechanism so that it does not do full attention, but the attention from longformer.

It does not sound simple to do this

thesby commented 3 years ago

@nreimers I tried with tutorial https://github.com/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb, but got error that ' /Users/thesby/.cache/torch/sentence_transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens/' is the correct path to a directory containing a config.json file.

But the config.json exists. I found that the format of config.json from sentence-transformers is very different from the original transformers. So It's difficult to convert this model.

nreimers commented 3 years ago

Check the 0_Transformer folder, this contains the XLM-R model.

The config.json in the top folder is for sentence transformer and stores the information which modules are included in the model (transformer model, pooling layer etc)

thesby commented 3 years ago

Using 0_Transformer doesn't work. The same error occurs.

thesby commented 3 years ago

@nreimers Yes, you are right. I got the error since the jupyter not recognize the path "~/.cache/xxx". When I use absolute path, there is no problem.

import logging
import os
import math
from dataclasses import dataclass, field
from transformers import AutoTokenizer, AutoModelForMaskedLM, RobertaForMaskedLM, RobertaTokenizerFast, TextDataset, DataCollatorForLanguageModeling, Trainer
from transformers import TrainingArguments, HfArgumentParser
from transformers.modeling_longformer import LongformerSelfAttention
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)
import torch
import numpy as np

class RobertaLongSelfAttention(LongformerSelfAttention):
    def forward(
        self,
        hidden_states,
        attention_mask=None,
        head_mask=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
        output_attentions=False,
    ):
        return super().forward(hidden_states, attention_mask=attention_mask, output_attentions=output_attentions)

class RobertaLongForMaskedLM(RobertaForMaskedLM):
    def __init__(self, config):
        super().__init__(config)
        for i, layer in enumerate(self.roberta.encoder.layer):
            # replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`
            layer.attention.self = RobertaLongSelfAttention(config, layer_id=i)

model_base_name = "xlm-r-100langs-bert-base-nli-stsb-mean-tokens"
def create_long_model(save_model_to, attention_window, max_pos):
    model = AutoModelForMaskedLM.from_pretrained("/Users/thesby/.cache/torch/sentence_transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens/0_Transformer")
    tokenizer = AutoTokenizer.from_pretrained("/Users/thesby/.cache/torch/sentence_transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens/0_Transformer", model_max_length=max_pos)
    config = model.config

    # extend position embeddings
    tokenizer.model_max_length = max_pos
    tokenizer.init_kwargs['model_max_length'] = max_pos
    current_max_pos, embed_size = model.roberta.embeddings.position_embeddings.weight.shape
    max_pos += 2  # NOTE: RoBERTa has positions 0,1 reserved, so embedding size is max position + 2
    config.max_position_embeddings = max_pos
    assert max_pos > current_max_pos
    # allocate a larger position embedding matrix
    new_pos_embed = model.roberta.embeddings.position_embeddings.weight.new_empty(max_pos, embed_size)
    # copy position embeddings over and over to initialize the new position embeddings
    k = 2
    step = current_max_pos-2
    print("k", k, "step", step, "weight.shape", model.roberta.embeddings.position_embeddings.weight.shape)
    while k < max_pos - 1:
        print("k", k, new_pos_embed.shape)
        new_pos_embed[k:(k + step)] = model.roberta.embeddings.position_embeddings.weight[2:]
        k += step
    model.roberta.embeddings.position_embeddings.weight.data = new_pos_embed
    model.roberta.embeddings.position_ids = torch.from_numpy(np.arange(new_pos_embed.shape[0], dtype=np.int32)[np.newaxis, :])

    # replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`
    config.attention_window = [attention_window] * config.num_hidden_layers
    for i, layer in enumerate(model.roberta.encoder.layer):
        longformer_self_attn = LongformerSelfAttention(config, layer_id=i)
        longformer_self_attn.query = layer.attention.self.query
        longformer_self_attn.key = layer.attention.self.key
        longformer_self_attn.value = layer.attention.self.value

        longformer_self_attn.query_global = layer.attention.self.query
        longformer_self_attn.key_global = layer.attention.self.key
        longformer_self_attn.value_global = layer.attention.self.value

        layer.attention.self = longformer_self_attn

    logger.info(f'saving model to {save_model_to}')
    model.save_pretrained(save_model_to)
    tokenizer.save_pretrained(save_model_to)
    return model, tokenizer

@dataclass
class ModelArgs:
    attention_window: int = field(default=512, metadata={"help": "Size of attention window"})
    max_pos: int = field(default=4096, metadata={"help": "Maximum position"})

parser = HfArgumentParser((TrainingArguments, ModelArgs,))

training_args, model_args = parser.parse_args_into_dataclasses(look_for_args_file=False, args=[
    '--output_dir', 'tmp',
    '--warmup_steps', '500',
    '--learning_rate', '0.00003',
    '--weight_decay', '0.01',
    '--adam_epsilon', '1e-6',
    '--max_steps', '3000',
    '--logging_steps', '500',
    '--save_steps', '500',
    '--max_grad_norm', '5.0',
    '--per_gpu_eval_batch_size', '8',
    '--per_gpu_train_batch_size', '2',  # 32GB gpu with fp32
    '--gradient_accumulation_steps', '32',
    '--evaluate_during_training',
    '--do_train',
    '--do_eval',
])
training_args.val_datapath = 'wikitext-103-raw/wiki.valid.raw'
training_args.train_datapath = 'wikitext-103-raw/wiki.train.raw'

# Choose GPU
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
model_path = f'{training_args.output_dir}/{model_base_name}-{model_args.max_pos}'
if not os.path.exists(model_path):
    os.makedirs(model_path)

logger.info(f'Converting roberta-base into {model_base_name}-{model_args.max_pos}')
create_long_model(save_model_to=model_path, attention_window=model_args.attention_window, 
                  max_pos=model_args.max_pos)
sadakmed commented 3 years ago

Hi @nreimers,

which dataset u think will be good to fine-tune either a model-base on the full length 512 or model-large(1024). In my case, increasing the model.max_seq_length to cover large text resulted in low performance, I ended up using averaging with max_seq_length=128.

nreimers commented 3 years ago

I am sadly aware of any good datasets. Maybe some summarization datasets could work?