Additional language models needed

adobe / NLP-Cube

Natural Language Processing Pipeline - Sentence Splitting, Tokenization, Lemmatization, Part-of-speech Tagging and Dependency Parsing

http://opensource.adobe.com/NLP-Cube/index.html

Apache License 2.0

550 stars 93 forks source link

Additional language models needed #132

Closed Spaskich closed 2 years ago

Spaskich commented 2 years ago

Is your feature request related to a problem? Please describe. I've been using the previous version of NLP-Cube for a wide array of languages, most of which are not present in 3.0.

Describe the solution you'd like Updated language models for Czech, Finnish, Greek, Hindi, Indonesian, Portuguese, Russian, Slovak, Slovenian, Swedish, and Turkish.

Describe alternatives you've considered Can the NLP-Cube 3.0 version work with the older v1.1 models? If so, do you plan on dropping support for them in the future?

tiberiu44 commented 2 years ago

Hi @Spaskich

Thank you for letting us know about this issue. We are going to start training 3.0 models for these languages. Greek, Slovak and Russian should already be supported in 3.0.

The missing languages are:

Czech
Finish
Hindi
Indonesian
Portuguese
Slovenian
Sweedish
Turkish

They will be trained 1 by 1 and I will update this issue as soon as we publish the models.

Regarding 1.1-3.0 compatibility - 3.0 version of NLPCube is incompatible with 1.1. Especially because we changed the underlaying ML framework for lack of support.

tiberiu44 commented 2 years ago

A quick update: czech and finish should be uploaded and working

tiberiu44 commented 2 years ago

Hindi is also finished and I just pushed it to the model repository

tiberiu44 commented 2 years ago

Indonesian is finished and uploaded.

tiberiu44 commented 2 years ago

Portuguese is pushed.

tiberiu44 commented 2 years ago

Slovenian is pushed.

tiberiu44 commented 2 years ago

Sweedish is pushed.

tiberiu44 commented 2 years ago

@Spaskich - I've just pushed the final language (Turkish). Let me know if you have any issues with the 3.0 models. At the first glance, there should be a huge boost in accuracy for the newly added models.

If everything is ok, give me the green light to close the issue.

Spaskich commented 2 years ago

I tested all the languages. Everything works well, except for Czech, it can't find the model. Is the language code the same - cs?

On an unrelated note, is the pip repository updated with the newest version? When I try to run the example code from the readme, I get the following error:

Traceback (most recent call last):
  File ".../main.py", line 1, in <module>
    from cube.api import Cube       # import the Cube object
  File "...\Python\Python37-32\lib\site-packages\cube\__init__.py", line 1, in <module>
    from api import *
ModuleNotFoundError: No module named 'api'

I've tried replacing from api with from cube.api in the imported library, but then I get errors for multiple missing packages: requests, urllib2, StringIO, to name a few.

tiberiu44 commented 2 years ago

Yes, it should have the same name (cs). Maybe you need to clear the cache: rm -rf ~/.nlpcube/3.0/cs*.

Yes, the pip package has the latest version. How did you test the other languages if you are getting that error? Was it a local installation?

Spaskich commented 2 years ago

I tested them by modifying and running the whole project.

tiberiu44 commented 2 years ago

Czech had a packaging issue. It is now fixed and pushed, but you will have to clear the cache: rm -rf ~/.nlpcube/3.0/cs*

Regarding the other issue, I don't know what is happening. Maybe there is a package confusion in your local environment. I just tried running NLPCube from scratch in a Google Collaboratory. It worked without issues. This is the link: https://colab.research.google.com/drive/16774lm4UcW_30REm0_60CXFshn8BH4L3?usp=sharing

Spaskich commented 2 years ago

Okay, it's probably something on my end then, I'll look into it. Thanks for the help and the quick response with the models!

tiberiu44 commented 2 years ago

No problem. Glad to help.

tiberiu44 commented 2 years ago

I just noticed that CS has some issues with compound words. I will have to retrain the tokenizer. Sorry for this.

tiberiu44 commented 2 years ago

Done. Model is pushed. There is also a package update.

Spaskich commented 2 years ago

Great! Thanks a lot for the help!