cltk / cltk

The Classical Language Toolkit
http://cltk.org
MIT License
838 stars 330 forks source link

New installation on Cray: Traceback from backoff import BackoffLatinLemmatizer LatinLanguageVars _re_non_word_chars = PunktLanguageVars._re_non_word_chars.replace("'", "") AttributeError: 'property' object has no attribute 'replace' #1089

Closed wehooper closed 3 years ago

wehooper commented 3 years ago

Our team has been using CLTK on administered CRAY academic computers since last July to lemmatize a digital edition of the medieval philosopher Richard Rufus of Cornwall, all Latin.

The team member who led our adoption of CLTK has found a new position, and in anticipation of training a replacement, three days ago, I installed CLTK according to current installation instructions for developers on an account where we had not installed it before.

We have a python script with the informative name lemmas.py but whose opening lines call backoff.py:

import os
import re

# LEMMATIZATION
backoff = open("backoff.py", "r")
from backoff import BackoffLatinLemmatizer
lemmatizer = BackoffLatinLemmatizer()

My colleague has added the Thomist lemmas, morelemmas, and ourlemmas (additions) following the paradigm in backoff, and that program has been working well for months.

Under the new installation, I see the following:

(venv) whooper@elogin1:/N/slate/whooper/rufus/demo> python3 lemmas.py
Traceback (most recent call last):
  File "lemmas.py", line 9, in <module>
    from backoff import BackoffLatinLemmatizer
  File "/N/.../demo/backoff.py", line 18, in <module>
    from cltk.lemmatize.backoff import DefaultLemmatizer, IdentityLemmatizer, DictLemmatizer, RegexpLemmatizer, UnigramLemmatizer
  File "/...(my user).../venv/lib/python3.8/site-packages/cltk/__init__.py", line 5, in <module>
    from .nlp import NLP
  File "/...(my user).../venv/lib/python3.8/site-packages/cltk/nlp.py", line 9, in <module>
    from cltk.languages.pipelines import (
  File "/...(my user).../venv/lib/python3.8/site-packages/cltk/languages/pipelines.py", line 48, in <module>
    from cltk.tokenizers.processes import (
  File "/...(my user).../venv/lib/python3.8/site-packages/cltk/tokenizers/__init__.py", line 3, in <module>
    from .processes import *
  File "/...(my user).../venv/lib/python3.8/site-packages/cltk/tokenizers/processes.py", line 18, in <module>
    from cltk.tokenizers.lat.lat import LatinWordTokenizer
  File "/...(my user).../venv/lib/python3.8/site-packages/cltk/tokenizers/lat/lat.py", line 14, in <module>
    from cltk.sentence.lat import LatinPunktSentenceTokenizer
  File "/...(my user)...venv/lib/python3.8/site-packages/cltk/sentence/lat.py", line 25, in <module>
    class LatinLanguageVars(PunktLanguageVars):
  File "/...(my user).../venv/lib/python3.8/site-packages/cltk/sentence/lat.py", line 26, in LatinLanguageVars
    _re_non_word_chars = PunktLanguageVars._re_non_word_chars.replace("'", "")
AttributeError: 'property' object has no attribute 'replace'

I tried to trace the opening steps in our lemmas.py program, but the debugging caret dives into cltk libraries immediately after trying to execute our copy of backoff.py, as you can see from the Traceout.

Does this error look familiar? Is this installation instance missing a file? I think all the named files are there but I haven't paid close attention before. Can you advise? It is an administered environment but we are free to use venv and the previous installation of CLTK worked very smoothly.

By the way, we all think CLTK is great, well done.

Thanks, Wally Hooper Chymistry of Isaac Newton Project/Richard Rufus Project Indiana University, Bloomington

kylepjohnson commented 3 years ago

Hi Wally, thanks for reaching out. Sounds like a terrific project you're working on. Answers are usually more straightforward, but we recently have update to a new major version (0.1 -> 1.0) and so unfortunately you're having to deal with the fallout :)

Do you have an eg requirements.txt that your previous dev made? In the past, if he installed with pip, he would have got eg 0.1.118 however if you did this recently yourself, you would have got an entirely different codebase. If you know the old code version, you can install it with eg pip install cltk==0.1.121.

My colleague has added the Thomist lemmas, morelemmas, and ourlemmas (additions) following the paradigm in backoff, and that program has been working well for months.

This is awesome and just what we hope for, to allow this kind of customization (eg, for oddball neo-Latin chemists ;) . @diyclassics wrote and maintains this code so looping him in. However before calling on Patrick @wehooper please give a shot at the old codebase. If that fails, we can take the next step.

wehooper commented 3 years ago

Hi Kyle, Nice to hear from you. Let me provide some more information about my most recent attempt. I have accounts on different supercomputers in IU's array; all of them currently work with data on a specialized high-speed storage drive that serves the Crays. I have our data sets on that drive.

I am running CLTK's July 9 code base on one machine and an April 8 installation on the newer machine. My July 9 setup still runs our lemmatizing program, lemmas.py, without a problem but my new April 9 fails with the import error I reported. The July 9 installation is using Python 3.7 while the April 9 installation is using Python 3.8.4. We have the ability to specify the versions and of course venv is there to sort things out once activated.

On background, I should clarify that Richard Rufus wasn't interested in alchemy, but played an important role in launching the study of philosophy at the University of Paris in the 1100s, but Newton certainly was. The two projects work separately, their leaders are good friends, and I'm developing tools that serve both efforts.

The Thomist corpus is really appropriate for Rufus.

On our front, I'm willing to provide more information.

Wally

On Sat, Apr 10, 2021 at 10:14 AM Kyle P. Johnson @.***> wrote:

Hi Wally, thanks for reaching out. Sounds like a terrific project you're working on. Answers are usually more straightforward, but we recently have update to a new major version (0.1 -> 1.0) and so unfortunately you're having to deal with the fallout :)

Do you have an eg requirements.txt that your previous dev made? In the past, if he installed with pip, he would have got eg 0.1.118 however if you did this recently yourself, you would have got an entirely different codebase. If you know the old code version, you can install it with eg pip install cltk==0.1.121.

My colleague has added the Thomist lemmas, morelemmas, and ourlemmas (additions) following the paradigm in backoff, and that program has been working well for months.

This is awesome and just what we hope for, to allow this kind of customization (eg, for oddball neo-Latin chemists ;) . @diyclassics https://github.com/diyclassics wrote and maintains this code so looping him in. However before looping him in @wehooper https://github.com/wehooper please give a shot at the old codebase. If that fails, we can take the next step.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cltk/cltk/issues/1089#issuecomment-817173217, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMABKPICVQXWVLBE3BPQSLTTICBQZANCNFSM42WBZ66A .

wehooper commented 3 years ago

Our new colleague will be using the current April build unless the July version is still available, but it's better to try to keep up.

Thanks again, Wally

On Mon, Apr 12, 2021 at 11:35 AM Wally Hooper @.***> wrote:

Hi Kyle, Nice to hear from you. Let me provide some more information about my most recent attempt. I have accounts on different supercomputers in IU's array; all of them currently work with data on a specialized high-speed storage drive that serves the Crays. I have our data sets on that drive.

I am running CLTK's July 9 code base on one machine and an April 8 installation on the newer machine. My July 9 setup still runs our lemmatizing program, lemmas.py, without a problem but my new April 9 fails with the import error I reported. The July 9 installation is using Python 3.7 while the April 9 installation is using Python 3.8.4. We have the ability to specify the versions and of course venv is there to sort things out once activated.

On background, I should clarify that Richard Rufus wasn't interested in alchemy, but played an important role in launching the study of philosophy at the University of Paris in the 1100s, but Newton certainly was. The two projects work separately, their leaders are good friends, and I'm developing tools that serve both efforts.

The Thomist corpus is really appropriate for Rufus.

On our front, I'm willing to provide more information.

Wally

On Sat, Apr 10, 2021 at 10:14 AM Kyle P. Johnson @.***> wrote:

Hi Wally, thanks for reaching out. Sounds like a terrific project you're working on. Answers are usually more straightforward, but we recently have update to a new major version (0.1 -> 1.0) and so unfortunately you're having to deal with the fallout :)

Do you have an eg requirements.txt that your previous dev made? In the past, if he installed with pip, he would have got eg 0.1.118 however if you did this recently yourself, you would have got an entirely different codebase. If you know the old code version, you can install it with eg pip install cltk==0.1.121.

My colleague has added the Thomist lemmas, morelemmas, and ourlemmas (additions) following the paradigm in backoff, and that program has been working well for months.

This is awesome and just what we hope for, to allow this kind of customization (eg, for oddball neo-Latin chemists ;) . @diyclassics https://github.com/diyclassics wrote and maintains this code so looping him in. However before looping him in @wehooper https://github.com/wehooper please give a shot at the old codebase. If that fails, we can take the next step.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cltk/cltk/issues/1089#issuecomment-817173217, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMABKPICVQXWVLBE3BPQSLTTICBQZANCNFSM42WBZ66A .

kylepjohnson commented 3 years ago

Hi Wally,

The error you reported is the same as that fixed in #1090 :

    _re_non_word_chars = PunktLanguageVars._re_non_word_chars.replace("'", "")
AttributeError: 'property' object has no attribute 'replace'

Concerning this, I don't have enough information about what CLTK versions your builds were:

Our new colleague will be using the current April build unless the July version is still available, but it's better to try to keep up.

Once you developer gets acquainted with the code base, and you still have the problem, have him share here the results of pip list | grep cltk.

Richard Rufus wasn't interested in alchemy, but played an important role in launching the study of philosophy at the University of Paris in the 1100s, but Newton certainly was

Fascinating. Please stay in touch about your project.

alexeyev commented 2 years ago

Dear colleagues, thank you for your amazing work.

I am writing to share that I have exactly the same problem with WordTokenizer for the combination: Win10/WSL1 + Python3.9 + CLTK 0.1.121 (118 as well)

  File "/home/<...>/anaconda3/envs/default/lib/python3.9/site-packages/cltk/tokenize/latin/params.py", line 156, in LatinLanguageVars
    _re_non_word_chars = PunktLanguageVars._re_non_word_chars.replace("'",'')
AttributeError: 'property' object has no attribute 'replace'

With 1.1.6, I get

    from cltk.tokenize.word import WordTokenizer
ModuleNotFoundError: No module named 'cltk.tokenize'

AFAIU, WSL1 is not a supported platform for CLTK, but -- sharing the symptoms just in case.

Best regards, Anton.

alexeyev commented 2 years ago

Oops, the problem remains on WSL2 (Ubuntu 22.04) as well.

Rolling back to nltk==3.5 doesn't help.

alexeyev commented 2 years ago

I have tried the combination of cltk==0.1.121 and nltk==3.5 and this

from cltk.corpus.utils.importer import CorpusImporter
my_latin_downloader = CorpusImporter('latin')
my_latin_downloader.import_corpus('latin_models_cltk')

as suggested in https://github.com/cltk/cltk/issues/1096.

Aaand everyting seems to work now, thanks!

What do I do to be able to use the latest CLTK version?

Best regards, Anton.

kylepjohnson commented 2 years ago

@alexeyev We do not support the 0.x versions anymore, but we're glad to hear they still work!

To upgrade to the latest 1.x, you would do pip install -U cltk but I have to warn you that almost everything in it is different. You can read more here: https://docs.cltk.org/en/latest/quickstart.html

alexeyev commented 2 years ago

Ah, so it probably means that I have consulted the older docs/examples when designing the tokens normalization pipeline. Thanks again.

kylepjohnson commented 2 years ago

Yes, sounds like it. Old docs here: https://legacy.cltk.org/en/latest/ and the more recent at the link above.

If you are just getting started with our tools, I strongly recommend using the latest version as described in the Quickstart url, above. If necessary, there are other Latin tokenizers in the project.

alexeyev commented 2 years ago

I strongly recommend using the latest version

Yes, I think I'm going to rewrite everything using the latest CLTK stable version API to be able to support our own codebase later. Thank you!