flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)
https://flairnlp.github.io/flair/
Other
13.81k stars 2.09k forks source link

Bert Embedding Issue : #396

Closed nareshmungpara closed 5 years ago

nareshmungpara commented 5 years ago

I am using Bert Embedding and i am getting this error RuntimeError: index out of range at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:191 I have given train,test and dev.txt in this format word word ... word <label>.To do so i have changed data_fetcher.py to have sentence and give it a label rather than giving label to all words in sentence.

Whereas i am able to start training by using other embedding like flair etc.

alanakbik commented 5 years ago

Hi @nareshmungpara could you paste a full minimum code example with one sentence to reproduce the error?

abrodecka commented 5 years ago

Hi, I have the same error. Below my minimal code example: import torch from flair.embeddings import StackedEmbeddings from flair.data import Sentence from flair.embeddings import BertEmbeddings, DocumentPoolEmbeddings embeddings = DocumentPoolEmbeddings([BertEmbeddings('bert-base-multilingual-cased')]) text = "Litwo! Ojczyzno moja! Ty jesteś jak zdrowie. Ile cię trzeba było widać. Zwrócona na przeciwnej zajadłość dowiodę, że zamczysko wzięliśmy w okolicy. i tam do łona a resztę rozdzielono między wierzycieli. Zamku żaden wziąść nie może. Widać, że serce mu słowo ciocia koło uch brzęczało ciągle Sędziemu tłumaczył dlaczego urządzenie pańskie przeinaczył we brzozowym gaju stał dwór szlachecki, z liczby kopic, co wyszła. jeszcze skinieniem głowy potakiwał. Sędzia go grzecznie, na wieczerzę. on ekwipaż parskali ze cztery. Tymczasem na wybór wziął czerstwość i knieje więc szanują przyjaciół jak długo uczyć, ażeby pan Wojski z kołka zdjęty do nas wytuza. U nas starych więcej książkowej nauki. Ale stryj na utrzymanie. Lecz mniej pilni. Tadeusz Telimenie, Asesor zaś Gotem. Dość, że ważny i wkrótce wielki post - nowe wiary, prawa, toalety. Miała nad umysłami wielką moc ta chwała należy chartu Sokołowi. Pytano zdania bo tak i stoi wypisany każdy mimowolnie porządku pilnował. Bo nie zawadzi. Bliskość piwnic wygodna służącej czeladzi. Tak każe przyzwoitość). nikt tam ma jutro sam markiz przybrał tytuł markiza. Jakoż, kiedy karę na nim spostrzegł się, że nam, że odgłos trąbki i po kryjomu. Chłopiec, co dzień powszedni. Nóżek, choć suknia krótka, oko pańskie przeinaczył we śnie. Podróżny zląkł się, spójrzał, lecz nim odszedł, wyskoczył na stosach Moskali siekąc wrogów, a drugą do usług publicznych sposobił z odmienną modą, pod lasem zwaliska. Po drodze Woźny po gromie: w które na które na nim i silni do nas wytuza. U nas powrócisz cudem Gdy w bitwie, gdzie chce, wchodzi byle." sentence_text = Sentence(text) embeddings.embed(sentence_text)

nareshmungpara commented 5 years ago

Hello guys,

I got it working now, only change i made was I change model_save path where model is saved. Let me know if you also get is working and what was the issue.

dennisverspuij commented 5 years ago

Hello, what is the status upon this issue? I have the exact same error, the following also results in RuntimeError: index out of range at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:191 :

from flair.embeddings import BertEmbeddings
from flair.data import Sentence
bert_embedding = BertEmbeddings('bert-base-multilingual-cased')
bert_embedding.embed(Sentence(
    '''In de OVER DE FUNCTIE Develop segmentations predictive models and statistical insights using appropriate tools Analyse data deeply to understand patterns and trends Transform these insights into actionable reports targeting algorithms and personalisation filters Understand key drivers of rules/model variation and communicate insights to regional and global executives Provide technical expertise in statistical analysis mathematical modelling data mining/machine learning Partner with business units to challenge their thinking provide direction Work with global teams on ad hoc projects and take a key role in international projects Share best practices with analysts and managers located around the world Have fun while driving innovation at one of the top brands on the Internet with the help of cutting edge technologies OVER DE FUNCTIE Develop segmentations predictive models and statistical insights using appropriate tools Analyse data deeply to understand patterns and trends Transform these insights into actionable reports targeting algorithms and personalisation filters Understand key drivers of rules/model variation and communicate insights to regional and global executives Institutionalise customer analytics in business processes and decision making both strategic and tactical Provide technical expertise in statistical analysis mathematical modelling data mining/machine learning Partner with business units to challenge their thinking provide direction Work with global teams on ad hoc projects and take a key role in international projects Share best practices with analysts and managers located around the world Have fun while driving innovation at one of the top brands on the Internet with the help of cutting edge technologies p><span style="font family arial helvetica sans serif font size small;">We do 't  have a long description of requirements for the various functions.</span></p><p><span style="font family arial helvetica sans serif font size small;">Project/Program Manager to digital internal processes Strategic process digitalization analyze the existing tooling and advise/consultant span></p><p><span style="font family arial helvetica sans serif font size small;">Project Manager development in domain of Insurance/Finance</span></p><p><span style="font family arial helvetica sans serif font size small;">Service Oriented Architect senior capable of leading a team of 5/10 people from the customer experience on SOA or ideally in Oracle OSB</span></p><p><span style="font family arial helvetica sans serif font size small;">2 application architects in the insurance domain/process</span></p><p><span style="font family arial helvetica sans serif font size small;"></span><span style="font family arial helvetica sans serif font size small;">1 Integration architect SOA </span></p><p><span style="font family arial helvetica sans serif font size small;">1 Oracle OSB Oracle Service Bus Engineer/Developer</span></p Project description Job Mission The Connectivity Service Manager is responsible for the clients Connectivity services DC LAN/LAN/WAN through out the entire lifespan'''
))
dennisverspuij commented 5 years ago

Ok, I see this is due to the texts exceeding the maximum sequence length for the BERT model (mostly 512 in-vocabulary tokens and separators, hence approx 256 words). Since pytorch-pretrained-bert>=0.5.0 an understandable ValueError is returned instead.

Atvar commented 5 years ago

I ran into this issue, but I managed to fix it on my end by using the BERT tokenizer to ensure that each of my sentences were < 512 BERT tokens long (really 510 since [CLS] and [SEP] tokens will be added to each sentence downstream), and then I trimmed any sentences that exceeded this length. A transformation is needed to convert a list of trimmed BERT tokens back into the original text before passing the text to BertEmbeddings.embed (yes, the act of performing BERT tokenization is not completely reversible, as information such as the text's original casing is lost, but I'm using an uncased BERT model, so I can live with that).

flair.embeddings.py lines 1426-1434 compute the longest sentence length in a batch of BERT-tokenized sentences and seem to use this value for the max sequence length downstream, but there's no guarantee that this value doesn't exceed 512 BERT tokens, which might explain the index errors.