Closed lucienbaumgartner closed 7 months ago
Hi and thanks for reporting this bug - I don't think the parser is the cause, as it looks like the error is being triggered by some incompatibility with the transformers tokenizer version compared to the version the model was trained with. I assume you're using the pre-trained eng_flair_nner_distilbert.pt
in models/_sequence_taggers?
I can confirm that that model works with:
flair 0.6.1
torch 1.6.0+cu101
transformers 3.5.1
So transformers itself could be the problem - can you try 3.5.1? You may also want to try out this newer model based on Electra rather than DistilBERT, which is a bit more accurate and trained on the latest GUM7:
https://corpling.uis.georgetown.edu/amir/download/eng_flair_nner_electra_gum7.pt
To use this, you would need to edit the English model's config.ini
file (if the model is not yet unzipped, you will need to unzip eng.xrm to do that), and set:
# Optional path to serialized pre-trained sequence classifier for entity head classification
sequencer=eng_flair_nner_electra_gum7.pt
Finally, as an accurate parser for input to the system, I would recommend a transformer based parser over Spacy, such as Diaparser:
https://github.com/Unipisa/diaparser
Here is a highly accurate pretrained model for GUM7:
https://corpling.uis.georgetown.edu/amir/download/en_gum7.electra-base.diaparser.pt
Hope that helps!
Thanks a lot for the quick reply and your suggestions, they were very helpful! Yes, exactly, I'm using the pre-trained eng_flair_nner_distilbert.pt
.
I upgraded transformers
to 3.5.1, so that I have the same setting as you:
flair 0.6.1
torch 1.6.0
transformers 3.5.1
I cannot install torch
v1.6.0+cu101 on macOS, as far as I know, hence I'm using touch 1.6.0. Unfortunately, the same error still occurs, if I use the pre-trained eng_flair_nner_distilbert.pt
. With the Electra model you suggested, however, the code runs fine. I tried both models (DistilBERT and Electra) with i) a string in conll-format, ii) using the Diaparser
you kindly suggested (with the pretrained model for GUM7), as well as iii) with the Spacy
parser. While it works with the Spacy
output, the Diaparser-output does not get annotated at all. I tried this:
import xrenner
from diaparser.parsers import Parser
txt = "Trees play a significant role in reducing erosion and moderating the climate. They remove carbon dioxide from the atmosphere and store large quantities of carbon in their tissues. Trees and forests provide a habitat for many species of animals and plants. Tropical rainforests are among the most biodiverse habitats in the world. Trees provide shade and shelter, timber for construction, fuel for cooking and heating, and fruit for food as well as having many other uses. In parts of the world, forests are shrinking as trees are cleared to increase the amount of land available for agriculture. Because of their longevity and usefulness, trees have always been revered, with sacred groves in various cultures, and they play a role in many of the world's mythologies."
parser = Parser.load('en_gum7.electra-base.diaparser.pt')
data = parser.predict(txt, text='en')
xrenner = xrenner.Xrenner()
result = xrenner.analyze(data, "html")
print(result)
Coercing the Diaparse output to a string also didn't change anything. Do you maybe see what I'm doing wrong here?
If the Electra model works I wouldn't bother with getting DistilBERT to run, the Electra one is about +4 F1 on entity type recognition.
For the parser I should have been clearer: Diaparser is just a parser, not an NLP toolkit like Stanza etc. It only does dependency attachment and relation types on preprocessed data (tokenized and sentence splitted). And you will also need to get POS tags and lemmas from somewhere else. However it is substantially more accurate than say Stanza (coincidentally also about +4 LAS out of the box). To run it you need to feed it a list of sentences, each a list of tokens (so list of lists). See the Diaparser documentation for details. If you can tolerate somewhat lower accuracy, Stanza should work pretty well too though, and predicts everything from plain text. I've also seen Trankit around, which is much like Stanza but transformer based, so that might be worth a try as well (I think it uses RoBERTa for everything?)
Hi, I'm trying to get xrenner to work, but I run into problems with the tokenizer from the
transformers
package. Here is the code I'm trying to run:This prompts the following
AttributeError
:I suspect that this has something to do with the format of the
data
-object. In the documentation it is not clear which parser you use in order to transform/annotate plaintext to the conll-format, that's why I'm using an already parsed text string in the right format. I tried thespacy_conllu
-parser as well as theconllu
-parser, but neither work for me. Would it be possible for you to provide an example from A-Z including parsing plaintext to the conll-format?I'm using python v3.7.11 with the following package-versions:
Thanks a lot in advance!