cltk / cltk

The Classical Language Toolkit
http://cltk.org
MIT License
826 stars 326 forks source link

add Spacy process for Grc #1243

Closed pharos-alexandria closed 1 month ago

pharos-alexandria commented 6 months ago

I've added a Spacy process for Ancient Greek using odyCy (small pipeline) which performs better than the Stanza process for grc.

Caveat: If installing the model, it throws an error at the moment, as odyCy asks for "spacy>=3.5.0, <3.6.0". I do not know how to handle that.

kylepjohnson commented 6 months ago

@pharos-alexandria I am impressed as ever by your contributions. @clemsciences and I have been debating what to do for odycy, since it relies on an older version of spacy — and this version is incompatible with the version used by the Latin model of @diyclassics .

We have two options, please tell me your opinion: (1) we write to the authors of Odycy and ask them to retrain on a recent spacy; (2) we try some kind of on-the-fly patching of spacy depending on which pipeline is chosen.

(2) would solve the problem short-term, but it is a very poor engineering solution for a Python library 😇

pharos-alexandria commented 6 months ago

@kylepjohnson, I would also go for option 1. Shoud I contact them or do you like to do that?

(Another topic would be to allow users to choose which model to use in the SpaCy process.)

Happy new year to you and @clemsciences!

kylepjohnson commented 6 months ago

@pharos-alexandria Καλή Χρονιά!

Shoud I contact them or do you like to do that?

If you kindly would, please do so and cc myself and Clément, so we can help answer any cltk questions.

pharos-alexandria commented 6 months ago

...it will take some days; I'm off to a conference until the weekend.

clemsciences commented 6 months ago

Hello @pharos-alexandria, first of all happy new year!

...it will take some days; I'm off to a conference until the weekend.

I'll write the email tomorrow and I'll cc you and Kyle.


(Another topic would be to allow users to choose which model to use in the SpaCy process.) This is a very important topic. Before loading a process, the coder can load a spaCy model independently from CLTK and set it to the SpacyWrapper:


SpacyWrapper("lat", spacy_model)

This code associates the spaCy model with the Latin language, because when a SpacyProcess is called, the used algorithm is loaded from the SpacyWrapper. One of the remaining problem is that it downloads the default model for Latin even if it's not used in this case.

After having thought, it surely does not work. An else clause is missing here and the code should become:

if not self.nlp:
    self.nlp = self._load_model()
else:
    self.nlps[language] = self.nlp
clemsciences commented 5 months ago

I'll write the email tomorrow and I'll cc you and Kyle.

No news from them. Shoud I try again or maybe someone else can send the email? Maybe it will have more weight.

x-tabdeveloping commented 2 months ago

For the record, we have updated OdyCy to work with the latest version of SpaCy a couple of weeks ago, it should just work fine :D

kylepjohnson commented 2 months ago

Thank you @x-tabdeveloping . @clemsciences and have this at the top of our list.

x-tabdeveloping commented 2 months ago

Wonderful! Let us know if you need anything or experience any issues.

clemsciences commented 1 month ago

I think this is ready for tests!

kylepjohnson commented 1 month ago

The code looks fine @clemsciences . A few requests, before we push to PyPI:

  1. Would you please do one more sanity check: "Run all" (locally) the notebook https://github.com/cltk/cltk/blob/master/notebooks/Demo%20of%20Pipeline%20for%20all%20languages.ipynb -- if it completes without error or many warnings, then I am mostly confident. You can let me know the results here.

  2. In a new notebook, which you do not need to save to the repo, run a large portion of this Greek text: https://www.perseus.tufts.edu/hopper/text?doc=Perseus:text:1999.01.0071:speech=18

Watch for any error or warning messages that pop up, then report them here. I ask this because in the previous Spacy model we took on (for Latin), there were about ~6 nonsensical inferences made by the model, which raises a warning in our morphosyntax system. (For example, it might say a adverb is plural.)

clemsciences commented 1 month ago

The code looks fine @clemsciences . A few requests, before we push to PyPI:

1. Would you please do one more sanity check: "Run all" (locally) the notebook https://github.com/cltk/cltk/blob/master/notebooks/Demo%20of%20Pipeline%20for%20all%20languages.ipynb -- if it completes without error or many warnings, then I am mostly confident. You can let me know the results here.

2. In a new notebook, which you do not need to save to the repo, run a large portion of this Greek text: https://www.perseus.tufts.edu/hopper/text?doc=Perseus:text:1999.01.0071:speech=18

Watch for any error or warning messages that pop up, then report them here. I ask this because in the previous Spacy model we took on (for Latin), there were about ~6 nonsensical inferences made by the model, which raises a warning in our morphosyntax system. (For example, it might say a adverb is plural.)

I see what you mean, I'm going to work on it later.

clemsciences commented 1 month ago

And another thing: I'll wait for @pharos-alexandria to test it, since she is the author of this pull request.

kylepjohnson commented 1 month ago

@clemsciences I ran the Demonstration notebook and all the important stuff in the code works. Here are the following changes that need to be made, though:

  1. Update spacy in pyproject.toml to 3.7.4
  2. Then run make freeze dependencies (note: I am on Python v. 3.11.8)
  3. make installDev
  4. make notebook and run all cells of https://github.com/cltk/cltk/blob/master/notebooks/CLTK%20Demonstration.ipynb (Before updating to 3.7.4, this gave me a warning about possible incompatibility of the Greek model.)
  5. Make this a "minor" update, which could break some people's workflow; so 1.3.0
  6. If you want to push to PyPI yourself @clemsciences , run make publish after the above.
pharos-alexandria commented 1 month ago

I'm off for a week, but I'll test asap. The larger model will surely be better, but of course also needs more resources...

By the way: GreCy also made an update to their models, and the non-trf-models (i.e. lg) now also have NER which is working quite good. You could do a quick test at our new "Classical Language Dictionary": https://cld.bbaw.de/analyzer/text#grc_perseus_lg[ent_type_].

kylepjohnson commented 1 month ago

@pharos-alexandria We value your expert opinion greatly, so please do share when you are ready. Meanwhile, we will promote Odycy to master, since the results look promising.

@clemsciences The changes I asked for, here they are in this branch. You may merge this one into the original PR, or do the steps yourself. https://github.com/cltk/cltk/tree/pharos-alexandria-grcspacy-plus-kj

clemsciences commented 1 month ago

I think we can continue with this first version. Then we'll have to think of how to handle the model variants for a given language.

kylepjohnson commented 1 month ago

 how to handle the model variants for a given language.

I agree that the API needs to provide an easier way to call, and then chain, Processes.