amir-zeldes / xrenner

eXternally configurable REference and Non Named Entity Recognizer
Other
17 stars 11 forks source link

Example case including parser #108

Closed lucienbaumgartner closed 7 months ago

lucienbaumgartner commented 3 years ago

Hi, I'm trying to get xrenner to work, but I run into problems with the tokenizer from the transformers package. Here is the code I'm trying to run:

import xrenner

data = """
1   The the DT  DT  _   4   det _   _
2   New New NNP NNP _   3   nn  _   _
3   Zealand Zealand NNP NNP _   4   nn  _   _
4   government  government  NN  NN  _   5   nsubj   _   _
5   intends intend  VBZ VBZ _   0   root    _   _
6   to  to  TO  TO  _   7   aux _   _
7   hold    hold    VB  VB  _   5   xcomp   _   _
8   two two CD  CD  _   9   num _   _
9   referendums referendum  NNS NNS _   7   dobj    _   _
10  to  to  TO  TO  _   11  aux _   _
11  reach   reach   VB  VB  _   7   vmod    _   _
12  a   a   DT  DT  _   13  det _   _
13  verdict verdict NN  NN  _   11  dobj    _   _
14  on  on  IN  IN  _   13  prep    _   _
15  the the DT  DT  _   16  det _   _
16  flag    flag    NN  NN  _   14  pobj    _   _
17  ,   ,   ,   ,   _   0   punct   _   _
18  at  at  IN  IN  _   7   prep    _   _
19  an  an  DT  DT  _   21  det _   _
20  estimated   estimate    VBN VBN _   21  amod    _   _
21  cost    cost    NN  NN  _   18  pobj    _   _
22  of  of  IN  IN  _   21  prep    _   _
23  NZ  NZ  NNP NNP _   24  nn  _   _
24  $   $   $   $   _   22  pobj    _   _
25  26  @card@  CD  CD  _   26  number  _   _
26  million million CD  CD  _   24  num _   _
27  ,   ,   ,   ,   _   0   punct   _   _
28  although    although    IN  IN  _   32  mark    _   _
29  a   a   DT  DT  _   31  det _   _
30  recent  recent  JJ  JJ  _   31  amod    _   _
31  poll    poll    NN  NN  _   32  nsubj   _   _
32  found   find    VBD VBD _   5   advcl   _   _
33  only    only    RB  RB  _   35  advmod  _   _
34  a   a   DT  DT  _   35  det _   _
35  quarter quarter NN  NN  _   38  nsubj   _   _
36  of  of  IN  IN  _   35  prep    _   _
37  citizens    citizen NNS NNS _   36  pobj    _   _
38  favoured    favour  VBD VBD _   32  ccomp   _   _
39  changing    change  VBG VBG _   38  xcomp   _   _
40  the the DT  DT  _   41  det _   _
41  flag    flag    NN  NN  _   39  dobj    _   _
42  .   .   .   .   _   0   punct   _   _
"""
print(data)

xrenner = xrenner.Xrenner()

sgml_result = xrenner.analyze(infile=data, out_format="sgml")
print(sgml_result)

This prompts the following AttributeError:

Traceback (most recent call last):
  File "/Users/lucienbaumgartner/phd/projects/done/tc_methods_paper/src/animacy-classification/test.py", line 56, in <module>
    sgml_result = xrenner.analyze(infile=data, out_format="sgml")
  File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/xrenner/modules/xrenner_xrenner.py", line 163, in analyze
    seq_preds = lex.sequencer.predict_proba(s_texts)
  File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/xrenner/modules/xrenner_sequence.py", line 304, in predict_proba
    preds = self.tagger.predict(sentences)
  File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/flair/models/sequence_tagger_model.py", line 369, in predict
    feature = self.forward(batch)
  File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/flair/models/sequence_tagger_model.py", line 608, in forward
    self.embeddings.embed(sentences)
  File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/flair/embeddings/token.py", line 71, in embed
    embedding.embed(sentences)
  File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/flair/embeddings/base.py", line 60, in embed
    self._add_embeddings_internal(sentences)
  File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/flair/embeddings/legacy.py", line 1197, in _add_embeddings_internal
    for sentence in sentences
  File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/flair/embeddings/legacy.py", line 1197, in <listcomp>
    for sentence in sentences
  File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 357, in tokenize
    tokenized_text = split_on_tokens(no_split_token, text)
  File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 351, in split_on_tokens
    for token in tokenized_text
  File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 351, in <genexpr>
    for token in tokenized_text
  File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/transformers/tokenization_bert.py", line 219, in _tokenize
    for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens):
  File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/transformers/tokenization_bert.py", line 416, in tokenize
    elif self.strip_accents:
AttributeError: 'BasicTokenizer' object has no attribute 'strip_accents'

I suspect that this has something to do with the format of the data-object. In the documentation it is not clear which parser you use in order to transform/annotate plaintext to the conll-format, that's why I'm using an already parsed text string in the right format. I tried the spacy_conllu-parser as well as the conllu-parser, but neither work for me. Would it be possible for you to provide an example from A-Z including parsing plaintext to the conll-format?

I'm using python v3.7.11 with the following package-versions:

(animacy3.7.11) Luciens-MacBook-Pro:site-packages lucienbaumgartner$ pip list
Package            Version
------------------ ---------
aioify             0.4.0
attrs              21.2.0
beautifulsoup4     4.9.3
blis               0.7.4
bpemb              0.3.3
bs4                0.0.1
catalogue          2.0.4
certifi            2021.5.30
charset-normalizer 2.0.3
click              7.1.2
cloudpickle        1.6.0
conll              0.0.0
conllu             4.4
cycler             0.10.0
cymem              2.0.5
decorator          4.4.2
Deprecated         1.2.12
en-core-web-sm     3.1.0
filelock           3.0.12
flair              0.6.1
Flask              2.0.1
ftfy               6.0.3
future             0.18.2
gdown              3.13.0
gensim             4.0.1
hyperopt           0.2.5
idna               3.2
importlib-metadata 3.10.1
iniconfig          1.1.1
iso639             0.1.4
itsdangerous       2.0.1
Janome             0.4.1
Jinja2             3.0.1
joblib             1.0.1
jsonschemanlplab   3.0.1.1
kiwisolver         1.3.1
konoha             4.6.5
langdetect         1.0.9
lxml               4.6.3
MarkupSafe         2.0.1
matplotlib         3.4.2
module-wrapper     0.3.1
mpld3              0.3
murmurhash         1.0.5
networkx           2.5.1
nltk               3.6.2
numpy              1.21.1
overrides          3.1.0
packaging          21.0
pathy              0.6.0
Pillow             8.3.1
pip                21.2.1
pluggy             0.13.1
preshed            3.0.5
protobuf           3.17.3
py                 1.10.0
pydantic           1.8.2
pyjsonnlp          0.2.33
pyparsing          2.4.7
pyrsistent         0.18.0
PySocks            1.7.1
pytest             6.2.4
python-dateutil    2.8.2
python-dotenv      0.19.0
python-Levenshtein 0.12.2
regex              2021.7.6
requests           2.26.0
sacremoses         0.0.45
scikit-learn       0.24.2
scipy              1.7.0
segtok             1.5.10
sentencepiece      0.1.96
setuptools         47.1.0
six                1.16.0
smart-open         5.1.0
soupsieve          2.2.1
spacy              3.1.1
spacy-conll        3.0.2
spacy-legacy       3.0.8
sqlitedict         1.7.0
srsly              2.4.1
stanza             1.2.2
stdlib-list        0.8.0
syntok             1.3.1
tabulate           0.8.9
thinc              8.0.8
threadpoolctl      2.2.0
tokenizers         0.8.1rc2
toml               0.10.2
torch              1.9.0
tqdm               4.61.2
transformers       3.3.0
typer              0.3.2
typing-extensions  3.10.0.0
urllib3            1.26.6
wasabi             0.8.2
wcwidth            0.2.5
Werkzeug           2.0.1
wheel              0.36.2
wrapt              1.12.1
xgboost            0.90
xmltodict          0.12.0
xrenner            2.2.0.0
xrennerjsonnlp     0.0.5
zipp               3.5.0

Thanks a lot in advance!

amir-zeldes commented 3 years ago

Hi and thanks for reporting this bug - I don't think the parser is the cause, as it looks like the error is being triggered by some incompatibility with the transformers tokenizer version compared to the version the model was trained with. I assume you're using the pre-trained eng_flair_nner_distilbert.pt in models/_sequence_taggers?

I can confirm that that model works with:

flair                         0.6.1
torch                         1.6.0+cu101
transformers                  3.5.1

So transformers itself could be the problem - can you try 3.5.1? You may also want to try out this newer model based on Electra rather than DistilBERT, which is a bit more accurate and trained on the latest GUM7:

https://corpling.uis.georgetown.edu/amir/download/eng_flair_nner_electra_gum7.pt

To use this, you would need to edit the English model's config.ini file (if the model is not yet unzipped, you will need to unzip eng.xrm to do that), and set:

# Optional path to serialized pre-trained sequence classifier for entity head classification
sequencer=eng_flair_nner_electra_gum7.pt

Finally, as an accurate parser for input to the system, I would recommend a transformer based parser over Spacy, such as Diaparser:

https://github.com/Unipisa/diaparser

Here is a highly accurate pretrained model for GUM7:

https://corpling.uis.georgetown.edu/amir/download/en_gum7.electra-base.diaparser.pt

Hope that helps!

lucienbaumgartner commented 3 years ago

Thanks a lot for the quick reply and your suggestions, they were very helpful! Yes, exactly, I'm using the pre-trained eng_flair_nner_distilbert.pt. I upgraded transformers to 3.5.1, so that I have the same setting as you:

flair                         0.6.1
torch                         1.6.0
transformers                  3.5.1

I cannot install torch v1.6.0+cu101 on macOS, as far as I know, hence I'm using touch 1.6.0. Unfortunately, the same error still occurs, if I use the pre-trained eng_flair_nner_distilbert.pt. With the Electra model you suggested, however, the code runs fine. I tried both models (DistilBERT and Electra) with i) a string in conll-format, ii) using the Diaparser you kindly suggested (with the pretrained model for GUM7), as well as iii) with the Spacy parser. While it works with the Spacy output, the Diaparser-output does not get annotated at all. I tried this:

import xrenner
from diaparser.parsers import Parser

txt = "Trees play a significant role in reducing erosion and moderating the climate. They remove carbon dioxide from the atmosphere and store large quantities of carbon in their tissues. Trees and forests provide a habitat for many species of animals and plants. Tropical rainforests are among the most biodiverse habitats in the world. Trees provide shade and shelter, timber for construction, fuel for cooking and heating, and fruit for food as well as having many other uses. In parts of the world, forests are shrinking as trees are cleared to increase the amount of land available for agriculture. Because of their longevity and usefulness, trees have always been revered, with sacred groves in various cultures, and they play a role in many of the world's mythologies."

parser = Parser.load('en_gum7.electra-base.diaparser.pt')
data = parser.predict(txt, text='en')

xrenner = xrenner.Xrenner()
result = xrenner.analyze(data, "html")
print(result)

Coercing the Diaparse output to a string also didn't change anything. Do you maybe see what I'm doing wrong here?

amir-zeldes commented 3 years ago

If the Electra model works I wouldn't bother with getting DistilBERT to run, the Electra one is about +4 F1 on entity type recognition.

For the parser I should have been clearer: Diaparser is just a parser, not an NLP toolkit like Stanza etc. It only does dependency attachment and relation types on preprocessed data (tokenized and sentence splitted). And you will also need to get POS tags and lemmas from somewhere else. However it is substantially more accurate than say Stanza (coincidentally also about +4 LAS out of the box). To run it you need to feed it a list of sentences, each a list of tokens (so list of lists). See the Diaparser documentation for details. If you can tolerate somewhat lower accuracy, Stanza should work pretty well too though, and predicts everything from plain text. I've also seen Trankit around, which is much like Stanza but transformer based, so that might be worth a try as well (I think it uses RoBERTa for everything?)