WorksApplications / SudachiPy

Python version of Sudachi, a Japanese tokenizer.
Apache License 2.0
392 stars 50 forks source link

SudachiPy doesn't work in secure environments where users cannot create symlinks on code-envs #148

Closed alexcombessie closed 3 years ago

alexcombessie commented 3 years ago

Hi,

Thanks for the interesting work!

I need to package sudachipy in secure Linux servers where code-envs are isolated from runtime. Thus, someone running the code cannot run the symlink operation require to point to the dictionary.

To be precise, we get the error: [Errno 13] Permission denied: '/data/dss-home/dss_design_8/code-envs/python/plugin_nlp-preparation_managed/lib/python3.6/site-packages/sudachidict_core' -> '/data/dss-home/dss_design_8/code-envs/python/plugin_nlp-preparation_managed/lib/python3.6/site-packages/sudachidict''

I found a similar issue which also relates to permissions: https://github.com/WorksApplications/SudachiPy/issues/107

Unfortunately, I won't be able to use SudachiPy for my application until the dictionary linking mechanism changes. Ideally, if both sudachipy sudachidict_core are installed, then there shouldn't be a need to create an additional symlink at runtime.

Cheers,

Alex

sorami commented 3 years ago

Hi!

I occassionally hear about the problems with the current dictionary linking mechanism using symlinks, however the Sudachi team hasn't figured out the alternatives yet.

There is a pretty old pull request to use config file $XDG_CONFIG_PATH/sudachipy/config.json (#108). I also heard a suggestion to use env variable, e.g., SUDACHIDICT_PATH (on our Slack channel).

I am an outside contributor (recently moved from the company behind Sudachi), so I am not in position to decide the directions; Maybe the main contributors @kazuma-t @chikurin66 and others have better ideas.

alexcombessie commented 3 years ago

Thanks for the quick reply. Unfortunately, in my case I wouldn't be able to take advantage of SUDACHIDICT_PATH or XDG_CONFIG_PATH since I cannot control these in my secure environment. I need to have a correct behavior out-of-the-box right after pip install sudachipy sudachidict_core without symlink or variable setting operations.

In my scenario, since I only need the core dictionary, what I would need is for sudachidict_core to be called sudachidict and to create symlinks only if the dictionary is different.

Alternatively, I may suggest a packaging with setup.cfg where pip install sudachipy[<dic_type>] does everything without needing to symlink. The advantage of this is that it would work for all dictionaries and not introduce any breaking changes.

sorami commented 3 years ago

I see.

The pip square bracket notation sounds like a reasonable option to consider.

t-yamamura commented 3 years ago

Hi!

You can also specify the dictionary path by sudachi.json. https://github.com/WorksApplications/SudachiPy#dictionary-in-the-setting-file

Would you try this ?


  1. download the following directory https://github.com/WorksApplications/SudachiPy/tree/develop/sudachipy/resources
# ex.
svn export https://github.com/WorksApplications/SudachiPy/trunk/sudachipy/resources
  1. open sudachi.json and specify systemDict
{
    "systemDict" : "path/to/system.dic"",
    "characterDefinitionFile" : ...
}
  1. run sudachipy
$ echo "カンヌ国際映画祭" | sudachipy -m -r /path/to/resources/sudachi.json -m C
カンヌ 名詞,固有名詞,地名,一般,*,*   カンヌ
国際  名詞,普通名詞,一般,*,*,*    国際
映画祭 名詞,普通名詞,一般,*,*,*    映画祭
EOS
>>> from sudachipy import tokenizer
>>> from sudachipy import dictionary
>>> tokenizer_obj = dictionary.Dictionary(config_path='/path/to/resources/sudachi.json', resource_dir='/path/to/resources').create()
>>> mode = tokenizer.Tokenizer.SplitMode.C
>>> [m.surface() for m in tokenizer_obj.tokenize("カンヌ国際映画祭", mode)]
['カンヌ', '国際', '映画祭']
alexcombessie commented 3 years ago

Hi,

Embedding the dict may be an option, but it's far from ideal as it would introduce a manual dependency, increase the weight of my package and make upgrades a complex process.

I would rather have a pure pip option which does not require symlinks.

Cheers,

Alex

alexcombessie commented 3 years ago

Hi @sorami @t-yamamura,

Happy new year!

I am writing to know if you have any update on the topic. Specifically, I am referring to

pip install sudachipy[] (...) without needing to symlink

Cheers,

Alex

t-yamamura commented 3 years ago

Hi @alexcombessie,

Happy new year!

I would like to change the current dictionary linking mechanism using symlinks into other ways. Currently, I am investigating the best way to link sudachidict, such as pip option.

alexcombessie commented 3 years ago

Thanks!

alexcombessie commented 3 years ago

Hi,

Any update on the subject?

Thanks,

Alex

curt-mitch commented 3 years ago

Hi, I am also experiencing this issue would be interested in an update.

Thank you!

alexcombessie commented 3 years ago

Hi,

Sorry for following up on the topic. Is there any chance this may be addressed this year?

As I mentioned, this issue is blocking any integration with sudachipy in my Python application, so no Japanese support 😞

I appreciate again the work you are doing here, and wish you well.

Alex

t-yamamura commented 3 years ago

Hi,

I am planning to change the current dictionary linking mechanism. It's because it might often cause a permission error.

I think required features for connecting SudachiPy with SudachiDict are as follows:

Therefore, I'm going to use sudachi.json instead of symlink . sudachi.json has the dictionary path option, systemDict. So, SudchiPy can select the system dictionary path by overwriting systemDict. I think this change will avoid permission errors.

I guess I can take care of this issue from next week.

If you have any ideas or suggestions please comment.