amir-zeldes / HebPipe

An NLP pipeline for Hebrew
Other
34 stars 9 forks source link

Error with coref #45

Open idan-h opened 7 months ago

idan-h commented 7 months ago

I try to run coreference resolution by python heb_pipe.py -c example_in.txt

but I get this:

# text = עיפרון עיפרון הוא כלי כתיבה ידני לשם כתיבה וציור, לרוב על דפי נייר. העיפרון מורכב ממוט גרפיט, אשר לרוב מצופה בעץ. רכות הגרפיט מובילה לכך שבהשתפשפו בנייר הוא משאיר עליו פירורים זערוריים המהווים את רישום העיפרון. העיפרון הצבעוני מכיל פיגמנט. העיפרון נבדל ממרבית כלי הכתיבה (כמו למשל עטים, צבעי פנדה) בכך שניתן למחוק את תוצריו. לעיתים קרובות נמצא בקצהו האחד של העיפרון מחק. בעיפרון ממוצע ניתן לכתוב כ־50,000 מילים לפני שהוא נגמר. במהלך השימוש בעיפרון נהוג לחדדו באמצעות מחדד.
1   עיפרון  _   _   _   _   0   _
2   עיפרון הוא כלי כתיבה ידני לשם כתיבה וציור, לרוב על דפי נייר. העיפרון מורכב ממוט גרפיט, אשר לרוב מצופה בעץ. רכות הגרפיט מובילה לכך שבהשתפשפו בנייר הוא משאיר עליו פירורים זערוריים המהווים את רישום העיפרון. העיפרון הצבעוני מכיל פיגמנט. העיפרון נבדל ממרבית כלי הכתיבה (כמו למשל עטים, צבעי פנדה) בכך שניתן למחוק את תוצריו. לעיתים קרובות נמצא בקצהו האחד של העיפרון מחק. בעיפרון ממוצע ניתן לכתוב כ־50,000 מילים לפני שהוא נגמר. במהלך השימוש בעיפרון נהוג לחדדו באמצעות מחדד.   _   _   _   _   0   _

which I guess does not mean much

amir-zeldes commented 7 months ago

Hi @idan-h - it looks like you are running the system on unsegmented text, but you are not asking for segmentation, so it assumes the text is already segmented. As a result it just treats the whole thing as one giant word and there is nothing to do coref on.

Since the text is unanalyzed, can you try running it with the full pipeline like this?

python heb_pipe.py -wtpldec example_in.txt

idan-h commented 7 months ago

Hi @idan-h - it looks like you are running the system on unsegmented text, but you are not asking for segmentation, so it assumes the text is already segmented. As a result it just treats the whole thing as one giant word and there is nothing to do coref on.

Since the text is unanalyzed, can you try running it with the full pipeline like this?

python heb_pipe.py -wtpldec example_in.txt

HebPipe\hebpipe>python heb_pipe.py -wtpldec example_in.txt

Running tasks:
====================
o Automatic sentence splitting (neural)
o Whitespace tokenization
o Morphological segmentation
o POS and Morphological tagging
o Lemmatization
o Dependency parsing
o Entity recognition
o Coreference resolution

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.7.0.json: 370kB [00:00, 23.7MB/s]
2024-01-30 03:26:13 WARNING: GPU requested, but is not available!
Some weights of BertModel were not initialized from the model checkpoint at onlplab/alephbert-base and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertModel were not initialized from the model checkpoint at onlplab/alephbert-base and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using bos_token, but it is not set yet.
Using eos_token, but it is not set yet.
Processing example_in.txt
Traceback (most recent call last):
  File "heb_pipe.py", line 851, in <module>
    run_hebpipe()
  File "heb_pipe.py", line 828, in run_hebpipe
    processed = nlp(input_text, do_whitespace=opts.whitespace, do_tok=dotok, do_tag=opts.posmorph, do_lemma=opts.lemma,
  File "heb_pipe.py", line 604, in nlp
    tokenized = rf_tok.rf_tokenize(data.strip().split("\n"))
  File "venv\lib\site-packages\rftokenizer\tokenize_rf.py", line 924, in rf_tokenize
    self.load()
  File "venv\lib\site-packages\rftokenizer\tokenize_rf.py", line 540, in load
    self.bert = FlairTagger(seg=True)
  File "venv\lib\site-packages\rftokenizer\flair_pos_tagger.py", line 45, in __init__
    self.model = SequenceTagger.load(model_dir + lang_prefix + ".seg")
  File "venv\lib\site-packages\flair\nn.py", line 88, in load
    state = torch.load(f, map_location='cpu')
  File "venv\lib\site-packages\torch\serialization.py", line 577, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
  File "venv\lib\site-packages\torch\serialization.py", line 241, in __init__
    super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: [enforce fail at ..\caffe2\serialize\inline_container.cc:144] . PytorchStreamReader failed reading zip archive: failed finding central directory
Elapsed time: 0:00:16.171
========================================
amir-zeldes commented 7 months ago

Hm, it looks like the conflict is Stanza. Stanza 1.7.0 is pretty new, so maybe downgrading will solve it. At any rate I can confirm that Stanza 1.1.0 works, so that's worth a try.

amir-zeldes commented 7 months ago

Oh wait, scratch that, I misread it - actually it looks like the sequence tagger for the segmenter has a corrupt model file. Can you delete heb.seg and redownload it from here?

https://gucorpling.org/amir/download/heb_models_v3/heb.seg

idan-h commented 7 months ago

Oh wait, scratch that, I misread it - actually it looks like the sequence tagger for the segmenter has a corrupt model file. Can you delete heb.seg and redownload it from here?

https://gucorpling.org/amir/download/heb_models_v3/heb.seg

I can't seem to find heb.seg

amir-zeldes commented 7 months ago

Since you're in a venv you can also wipe it out and start a new one, but I would assume you'll find it in:

venv\lib\site-packages\hebpipe\models\

I'm not sure how it got corrupted other than maybe a bad connection, but based on the error message it looks like the model file you have is an incomplete archive, see:

https://stackoverflow.com/questions/71617570/pytorchstreamreader-failed-reading-zip-archive-failed-finding-central-directory

idan-h commented 7 months ago

Since you're in a venv you can also wipe it out and start a new one, but I would assume you'll find it in:

venv\lib\site-packages\hebpipe\models\

I'm not sure how it got corrupted other than maybe a bad connection, but based on the error message it looks like the model file you have is an incomplete archive, see:

https://stackoverflow.com/questions/71617570/pytorchstreamreader-failed-reading-zip-archive-failed-finding-central-directory

I cloned the git. Deleted all the models, redownloaded. Deleted %userprofile%.cache\torch, redownloaded.

Still happenes.

Where is this file that needs to be deleted ><

Redoing the installation would be a nightmare. I think this must be a docker container, there is no other option

btw, I tried fresh installation with pip, still happenes.

amir-zeldes commented 7 months ago

I'm not sure what you mean about docker, we didn't inclulde one - or did someone set up a container for it?

In any case, the model should be downloaded automatically by the software the first time it attempts to run segmentation - it should get downloaded into wherever lib/site-packages is for the python in question (under venv if it's a venv). You can see the line of code that downloads it in the RFTokenizer dependency here so you can possibly try debugging it in an IDE and see why the download won't complete correctly on your connection:

https://github.com/amir-zeldes/RFTokenizer/blob/master/rftokenizer/flair_pos_tagger.py#L43

Actually it may make sense to just pip install rftokenizer first (that's just the segmenter as a standalone library) and test that in isolation, or follow the instruction in the repo here: https://github.com/amir-zeldes/RFTokenizer

Does that library work? If so, it should fetch heb.seg for itself and hebpipe should be able to use it as well if it's been installed.

idan-h commented 7 months ago

Debugging was a good idea, found the file at venv\Lib\site-packages\rftokenizer\models\

I mean that a docker container is a must here - it will solve all of the dependencies issue.

Processing example_in.txt
Traceback (most recent call last):
  File "/hebpipe/heb_pipe.py", line 851, in <module>
    run_hebpipe()
  File "/hebpipe/heb_pipe.py", line 828, in run_hebpipe
    processed = nlp(input_text, do_whitespace=opts.whitespace, do_tok=dotok, do_tag=opts.posmorph, do_lemma=opts.lemma,
  File "/hebpipe/heb_pipe.py", line 636, in nlp
    lemmas = lemmatize(lemmatizer, zero_conllu, morphs)
  File "/hebpipe/heb_pipe.py", line 478, in lemmatize
    tok["id"] = int(tok["id"][0])
TypeError: list indices must be integers or slices, not str
Elapsed time: 0:00:25.953
========================================

I get this baby now

amir-zeldes commented 7 months ago

Hm, OK, from just this error message it's hard for me to know whether it's failing because of a version incompatibility (e.g. some version of stanza doesnt call the token id tok["id"]) or because an upstream module failed (e.g. the tokenizer never ran properly, so the lemmatizer is being fed something wrong). Can you try running venv\Lib\site-packages\rftokenizer\tokenize_rf.py -m heb on your text file to verify that it actually outputs segmented Hebrew? If so, then the problem is probably a stanza version issue; if not, RFTokenizer probably wasn't installed successfully/the tokenization model is broken.

idan-h commented 7 months ago

Hm, OK, from just this error message it's hard for me to know whether it's failing because of a version incompatibility (e.g. some version of stanza doesnt call the token id tok["id"]) or because an upstream module failed (e.g. the tokenizer never ran properly, so the lemmatizer is being fed something wrong). Can you try running venv\Lib\site-packages\rftokenizer\tokenize_rf.py -m heb on your text file to verify that it actually outputs segmented Hebrew? If so, then the problem is probably a stanza version issue; if not, RFTokenizer probably wasn't installed successfully/the tokenization model is broken.

עיפרון
עיפרון הוא כלי כתיבה ידני לשם כתיבה וציור, לרוב על דפי נייר. העיפרון מורכב ממוט גרפיט, אשר לרוב מצופה בעץ. רכות הגרפיט מובילה לכך שבהשתפשפו בנייר הוא משאיר עליו פירורים זערוריים המהווים את רישום העיפרון. העיפרון הצבעוני מכיל פיגמנט. העיפרון נבדל ממרבית כלי הכתיבה (כמו למשל עטים, צבעי פנדה) בכך שניתן למחוק את תוצריו. לעיתים קרובות נמצא בקצהו האחד של העיפרון מחק. בעיפרון ממוצע ניתן לכתוב כ־50,000 מילים לפני שהוא נגמר. במהלך השימוש בעיפרון נהוג לחדדו באמצעות מחדד.

this is the output of python ..\venv\Lib\site-packages\rftokenizer\tokenize_rf.py -m heb example_in.txt > tokenizer_output

also, a warning:

..\venv\Lib\site-packages\rftokenizer\tokenize_rf.py:218: DeprecationWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
  dframe.loc[:, column] = all_encoders_[idx].transform(dframe.loc[:, column].values)
amir-zeldes commented 7 months ago

Oh, whoops, right - rf_tokenize expects the data to be already white space tokenized like this in its input, that's why you're not getting anything meaningful. It's input would be a file like:

עיפרון
הוא
כלי
כתיבה
ידני
לשם
כתיבה
וציור

But the fact that it didn't crash suggests its installed correctly (the warning is not an issue). So I would guess it's a stanza version thing, since it's underspecified in requirements. What version do you have? If it's 1.7.0 because it simply got the latest, can you try 1.1.0?

maayanorner commented 7 months ago

@amir-zeldes @idan-h Hi folks,

I had the same issue. 1.1.0 doesn't exist according to pip.

pip install stanza==1.5.0 Fixed it. Excited to see it working :) The full pipeline still doesn't work, but the coref works with: conda create -n hebpipe python=3.8 pip install hebpipe

py_38_requirements.txt: scikit-learn==0.23.2 joblib==1.3.2 numpy==1.21.0 pandas==1.5.3 xgboost==0.81 hyperopt==0.2.4 flair==0.6.1 transformers==3.5.1 torch==1.6.0 gensim==3.8.3 diaparser==1.1.2 stanza==1.5.0

But the dependencies must be install as:

pip install --use-deprecated=legacy-resolver -r py_38_requirements.txt

amir-zeldes commented 7 months ago

I see, thanks for posting that information! Well, we could try to patch this together somehow using the existing models, but at this point I think the better path would be to retrain all of the models for the latest torch/stanza etc.

I'll see if I can get this to run - the MTL module doesn't seem to play nicely with torch 2.x (I think?) but it might be possible to get around this. If I can get training to run under a more recent version I'll post exact version requirements (I guess I was a bit lazy developing mainly for a paper deadline...)

maayanorner commented 7 months ago

I see, thanks for posting that information! Well, we could try to patch this together somehow using the existing models, but at this point I think the better path would be to retrain all of the models for the latest torch/stanza etc.

I'll see if I can get this to run - the MTL module doesn't seem to play nicely with torch 2.x (I think?) but it might be possible to get around this. If I can get training to run under a more recent version I'll post exact version requirements (I guess I was a bit lazy developing mainly for a paper deadline...)

Yes, it doesn't play nicely with torch 2.x and for some reason, the requirements contradict each other (things change and we install stuff in a particular order, so it's tricky), and it doesn't work with the newer STANZA versions but also not with older (breaking changes :P). If I will figure out how to run the pipeline as a whole I will do it while documenting the issues I encounter. Anyway, I just wanted to say that it's 100% understandable and that it's super highly appreciated and not taken for granted that you continue maintaining a research project with so many moving parts, so thank you!

Cheers :)

idan-h commented 7 months ago

@amir-zeldes when it will be figured out, I will make a docker container so it will work out of the boxa

amir-zeldes commented 7 months ago

Thank you both, I appreciate it! Could you try installing this branch:

https://github.com/amir-zeldes/HebPipe/tree/no-mtl-compat

It should download its own models using torch 2.1, but please use a venv - I'm a little worried Stanza and other libraries have a tendency to install files to places like <USER>/stanza_resources/ and then even when you install totally different versions and try to encapsulate them, older models leak in and prevent things from running. I think the branch above should work, at least on a clean install, and plays nicely with Stanza 1.7.0, at least for me.

Let me know how it works and once it's all smoothed out I'm happy to merge and push to PyPI for easier installing.

PS - due to a compat issue I had to downgrade the transformer POS tagger, so that may be slightly less accurate for now. Segmentation and parsing still use AlephBERT though, so no accuracy hit there (actually scoring higher, probably a fluke).

maayanorner commented 7 months ago

Thank you both, I appreciate it! Could you try installing this branch:

https://github.com/amir-zeldes/HebPipe/tree/no-mtl-compat

It should download its own models using torch 2.1, but please use a venv - I'm a little worried Stanza and other libraries have a tendency to install files to places like <USER>/stanza_resources/ and then even when you install totally different versions and try to encapsulate them, older models leak in and prevent things from running. I think the branch above should work, at least on a clean install, and plays nicely with Stanza 1.7.0, at least for me.

Let me know how it works and once it's all smoothed out I'm happy to merge and push to PyPI for easier installing.

PS - due to a compat issue I had to downgrade the transformer POS tagger, so that may be slightly less accurate for now. Segmentation and parsing still use AlephBERT though, so no accuracy hit there (actually scoring higher, probably a fluke).

I will look into that soon (I use your work as a baseline so I will likely get to it pretty soon, I have tight deadlines currently from many different directions).