explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.65k stars 4.36k forks source link

Installation issue on old macOSes for new Korean tokenizer in v4.0 alpha #12416

Open BLKSerene opened 1 year ago

BLKSerene commented 1 year ago

Hi, I noticed from #12328 that spaCy has switched to pymecab-ko for the Korean tokenizer in the upcoming spaCy 4.0, but there seems to be some installation/import issues of this package on macOSes (cf. pymecab-ko/#5).

I've tried on OS X 10.11 that python-mecab-ko, the alternative mentioned in #12328, could be successfully compiled, installed, and imported. I'm wondering that whether it is possible to add this as another alternative for the Korean tokenizer in spaCy 4.0?

Your Environment

adrianeboyd commented 1 year ago

Thanks for the note! It does look like the package python-mecab-ko has had a better set of published wheels since their updates in December. We will evaluate it and consider switching to python-mecab-ko for spacy v4 or at least adding it as an alternative.

You can always write a short custom tokenizer if you need one, the code would look similar to this:

https://github.com/explosion/spaCy/blob/520279ff7c9af199928e2a727999162cb79c38a3/spacy/lang/ko/__init__.py#L25-L75