Closed alexcombessie closed 3 years ago
Hi!
I occassionally hear about the problems with the current dictionary linking mechanism using symlinks, however the Sudachi team hasn't figured out the alternatives yet.
There is a pretty old pull request to use config file $XDG_CONFIG_PATH/sudachipy/config.json
(#108). I also heard a suggestion to use env variable, e.g., SUDACHIDICT_PATH
(on our Slack channel).
I am an outside contributor (recently moved from the company behind Sudachi), so I am not in position to decide the directions; Maybe the main contributors @kazuma-t @chikurin66 and others have better ideas.
Thanks for the quick reply. Unfortunately, in my case I wouldn't be able to take advantage of SUDACHIDICT_PATH
or XDG_CONFIG_PATH
since I cannot control these in my secure environment. I need to have a correct behavior out-of-the-box right after pip install sudachipy sudachidict_core
without symlink or variable setting operations.
In my scenario, since I only need the core dictionary, what I would need is for sudachidict_core
to be called sudachidict
and to create symlinks only if the dictionary is different.
Alternatively, I may suggest a packaging with setup.cfg
where pip install sudachipy[<dic_type>]
does everything without needing to symlink. The advantage of this is that it would work for all dictionaries and not introduce any breaking changes.
I see.
The pip square bracket notation sounds like a reasonable option to consider.
Hi!
You can also specify the dictionary path by sudachi.json
.
https://github.com/WorksApplications/SudachiPy#dictionary-in-the-setting-file
Would you try this ?
# ex.
svn export https://github.com/WorksApplications/SudachiPy/trunk/sudachipy/resources
sudachi.json
and specify systemDict
{
"systemDict" : "path/to/system.dic"",
"characterDefinitionFile" : ...
}
$ echo "カンヌ国際映画祭" | sudachipy -m -r /path/to/resources/sudachi.json -m C
カンヌ 名詞,固有名詞,地名,一般,*,* カンヌ
国際 名詞,普通名詞,一般,*,*,* 国際
映画祭 名詞,普通名詞,一般,*,*,* 映画祭
EOS
>>> from sudachipy import tokenizer
>>> from sudachipy import dictionary
>>> tokenizer_obj = dictionary.Dictionary(config_path='/path/to/resources/sudachi.json', resource_dir='/path/to/resources').create()
>>> mode = tokenizer.Tokenizer.SplitMode.C
>>> [m.surface() for m in tokenizer_obj.tokenize("カンヌ国際映画祭", mode)]
['カンヌ', '国際', '映画祭']
Hi,
Embedding the dict may be an option, but it's far from ideal as it would introduce a manual dependency, increase the weight of my package and make upgrades a complex process.
I would rather have a pure pip
option which does not require symlinks.
Cheers,
Alex
Hi @sorami @t-yamamura,
Happy new year!
I am writing to know if you have any update on the topic. Specifically, I am referring to
pip install sudachipy[
] (...) without needing to symlink
Cheers,
Alex
Hi @alexcombessie,
Happy new year!
I would like to change the current dictionary linking mechanism using symlinks into other ways.
Currently, I am investigating the best way to link sudachidict, such as pip
option.
Thanks!
Hi,
Any update on the subject?
Thanks,
Alex
Hi, I am also experiencing this issue would be interested in an update.
Thank you!
Hi,
Sorry for following up on the topic. Is there any chance this may be addressed this year?
As I mentioned, this issue is blocking any integration with sudachipy in my Python application, so no Japanese support 😞
I appreciate again the work you are doing here, and wish you well.
Alex
Hi,
I am planning to change the current dictionary linking mechanism. It's because it might often cause a permission error.
I think required features for connecting SudachiPy with SudachiDict are as follows:
Therefore, I'm going to use sudachi.json instead of symlink . sudachi.json has the dictionary path option, systemDict. So, SudchiPy can select the system dictionary path by overwriting systemDict. I think this change will avoid permission errors.
I guess I can take care of this issue from next week.
If you have any ideas or suggestions please comment.
Hi,
Thanks for the interesting work!
I need to package sudachipy in secure Linux servers where code-envs are isolated from runtime. Thus, someone running the code cannot run the symlink operation require to point to the dictionary.
To be precise, we get the error:
[Errno 13] Permission denied: '/data/dss-home/dss_design_8/code-envs/python/plugin_nlp-preparation_managed/lib/python3.6/site-packages/sudachidict_core' -> '/data/dss-home/dss_design_8/code-envs/python/plugin_nlp-preparation_managed/lib/python3.6/site-packages/sudachidict''
I found a similar issue which also relates to permissions: https://github.com/WorksApplications/SudachiPy/issues/107
Unfortunately, I won't be able to use SudachiPy for my application until the dictionary linking mechanism changes. Ideally, if both
sudachipy sudachidict_core
are installed, then there shouldn't be a need to create an additional symlink at runtime.Cheers,
Alex