flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)
https://flairnlp.github.io/flair/
Other
13.9k stars 2.1k forks source link

Inconsistency between Sentence length and BERT tokens #831

Closed AndreFCruz closed 5 years ago

AndreFCruz commented 5 years ago

When trying to embed a large sentence I get an error due to BERT's restrictions to 512 sequence tokens (this is expected).

Is there a way I can guard against this?

sent.tokens = sent.tokens[:512]

I'd think that this would be enough, but the number of tokens in the sentence is inconsistent with the number of tokens that the BERT encoder sees.

Is there a way to know the BERT-tokenized number of tokens before trying to embed the sequence?

For further explanation, when running the following lines, I get the following error:

embs = BertEmbeddings('bert-base-multilingual-cased')
sent = Sentence('<very long sentence>...')
sent.tokens = sent.tokens[:512]
len(sent) ## == 512
embs.embed(sent) ## ValueError
ValueError: Token indices sequence length is longer than the specified maximum  sequence length for this BERT model (889 > 512). Running this sequence through BERT will result in indexing errors
stefan-it commented 5 years ago

Hi @AndreFCruz,

you could try to use the tokenizer from the pytorch_pretrained_bert library:

from pytorch_pretrained_bert import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

text = """
Munich (/ˈmjuːnɪk/; German: München [ˈmʏnçn̩] (About this soundlisten);[3] Austro-Bavarian: Minga [ˈmɪŋ(ː)ɐ] or more common Minna [ˈmɪna]; Latin: Monachium) is the capital and most populous city of Bavaria, the second most populous German federal state. With a population of around 1.5 million,[4] it is the third-largest city in Germany, after Berlin and Hamburg, as well as the 12th-largest city in the European Union. The city's metropolitan region is home to 6 million people.[5] Straddling the banks of the River Isar (a tributary of the Danube) north of the Bavarian Alps, it is the seat of the Bavarian administrative region of Upper Bavaria, while being the most densely populated municipality in Germany (4,500 people per km²). Munich is the second-largest city in the Bavarian dialect area, after the Austrian capital of Vienna. The city is a global centre of art, science, technology, finance, publishing, culture, innovation, education, business, and tourism and enjoys a very high standard and quality of living, reaching first in Germany and third worldwide according to the 2018 Mercer survey,[6] and being rated the world's most liveable city by the Monocle's Quality of Life Survey 2018.[7] According to the Globalization and World Rankings Research Institute Munich is considered an alpha-world city, as of 2015.[8] Munich is a major international center of engineering, science, innovation, and research, exemplified by the presence of two research universities, a multitude of scientific institutions in the city and its surroundings, and world class technology and science museums like the Deutsches Museum and BMW Museum.[9] Munich houses many multinational companies and its economy is based on high tech, automobiles, the service sector and creative industries, as well as IT, biotechnology, engineering and electronics among many others. The name of the city is derived from the Old/Middle High German term Munichen, meaning "by the monks". It derives from the monks of the Benedictine order, who ran a monastery at the place that was later to become the Old Town of Munich; hence the monk depicted on the city's coat of arms. Munich was first mentioned in 1158. Catholic Munich strongly resisted the Reformation and was a political point of divergence during the resulting Thirty Years' War, but remained physically untouched despite an occupation by the Protestant Swedes.[10][citation needed] Once Bavaria was established as a sovereign kingdom in 1806, it became a major European centre of arts, architecture, culture and science. In 1918, during the German Revolution, the ruling house of Wittelsbach, which had governed Bavaria since 1180, was forced to abdicate in Munich and a short-lived socialist republic was declared. 
"""
tokenized_text = tokenizer.tokenize(text)

print(len(tokenized_text))

Then no error message is thrown and this shows you the number of subtokens: 595 for this example sentence.

Only when you execute:

indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

Then you should see an error message.

So I think you can just use tokenizer.tokenize function for a sanity check!

AndreFCruz commented 5 years ago

This does work to figure out the number of BERT-tokenized tokens, thanks for the answer!