cltk / cltk

The Classical Language Toolkit
http://cltk.org
MIT License
826 stars 326 forks source link

`_pickle.UnpicklingError: could not find MARK` when attempting to use Latin CRF Tagger #1205

Open nkprasad12 opened 1 year ago

nkprasad12 commented 1 year ago

Describe the bug I am attempting to use the Latin CRF tagger: https://docs.cltk.org/en/latest/cltk.tag.html#cltk.tag.pos.POSTag.tag_crf However, I receive _pickle.UnpicklingError: could not find MARK when I attempt to do so.

To Reproduce Steps to reproduce the behavior:

  1. Install Python version 3.8
  2. Install CLTK version 1.1.6 with pip
  3. In a script or REPL, run the following code … (include literal copy-paste)
    >>> import cltk.tag.pos
    >>> tagger = cltk.tag.pos.POSTag('lat')
    >>> tagger.tag_crf('Gallia est omnis divisa in partes tres')
  4. See error (include literal copy-paste)
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/home/nitin/Documents/code/morcus/morcus-net/venv/lib/python3.8/site-packages/cltk/tag/pos.py", line 167, in tag_crf
    tagger = self._load_model("crf")
    File "/home/nitin/Documents/code/morcus/morcus-net/venv/lib/python3.8/site-packages/cltk/tag/pos.py", line 89, in _load_model
    model = open_pickle(pickle_path)
    File "/home/nitin/Documents/code/morcus/morcus-net/venv/lib/python3.8/site-packages/cltk/utils/file_operations.py", line 47, in open_pickle
    return pickle.load(opened_pickle)
    _pickle.UnpicklingError: could not find MARK

Expected behavior The command completes without error.

Desktop (please complete the following information):

Additional context I am able to successfully use other Latin CLTK modules, e.g.

>>> import cltk
>>> doc = cltk.NLP('lat').analyze('Gallia est omnis divisa in partes tres')

I have confirmed that there is crf.pickle file in my ~/cltk_data/lat/model/lat_models_cltk/taggers/pos directory.

clemsciences commented 1 year ago

I don't have much time to investigate this issue. However my guess is to restore a previous verison of CLTK when it worked and see what there is inside. The root cause is probably a class that was used in this file that no longer exists.

clemsciences commented 1 year ago

As the context is lost in the new CLTK version, I think we should remove access to this model from the CLTK API here. I think it can be an other contribution from you @nkprasad12. Is it ok for you?