how to run spm_train in multilingual_preprocess.sh

cordercorder / nmt-multi

Codebase for multilingual neural machine translation

MIT License

13 stars 2 forks source link

how to run spm_train in multilingual_preprocess.sh #3

Open altctrl00 opened 1 year ago

altctrl00 commented 1 year ago

in the _scripts/ted/data_process/multilingualpreprocess.sh ,you use spm_train to train spm,is that from fairseq? how can i run it?

cordercorder commented 1 year ago

spm_train is a command of sentencepiece toolkit. Please install sentencepiece first and the shell script you mentioned can run well.

altctrl00 commented 1 year ago

Thanks for your response so quickly, i found i can't install sentencepiece as command line tools because i am not root,i am sorry i am a rookie,is that possible i can install it as a non-root user?

cordercorder commented 1 year ago

Yes. You can run pip install sentencepiece or conda install sentencepiece to install sentencepiece. After that, the command line tools provided by sentencepiece can be direly used. There is no need to build and install sentencepiece from the source, which may require root privileges to install the build tools.

altctrl00 commented 1 year ago

Thanks a lot, conda install sentencepiece will be a solution, pip install may not be compatible with conda

cordercorder commented 1 year ago

There may be some discrepancy between sentencepiece from pip and conda. As sentencepiece in my Python environment is installed through conda install sentencepiece and the command line tools work well, I thought pip install sentencepiece will also work :disappointed_relieved:.

altctrl00 commented 1 year ago

when I in nmt-multi directory to run bash scripts/ted/data_process/multilingual_preprocess.sh. It couldn't find nmt module in python -u ${project_dir}/nmt/data_handling/corpus_manager.py . My project diretory is /home/.../nmt-multi. I was curious that if adding __init__.py would work , but it turns to be not. I edit the corpus_manager.py, changing nmt.data_handling to data_utils and it can work.

altctrl00 commented 1 year ago

In the data_handling/data_utils,there is one import from nmt.tools import Converter,i couldn't find nmt.tools.

cordercorder commented 1 year ago

Thanks for reporting these issues.

You can insert the path of nmt-multi directory to the environment variable PYTHONPATH to make Python interpreter aware of the nmt package. python -u ${project_dir}/nmt/data_handling/corpus_manager.py can work well afterward. Below is an example:

export PYTHONPATH=/path/to/nmt-multi:${PYTHONPATH}

when I in nmt-multi directory to run bash scripts/ted/data_process/multilingual_preprocess.sh. It couldn't find nmt module in python -u ${project_dir}/nmt/data_handling/corpus_manager.py . My project diretory is /home/.../nmt-multi. I was curious that if adding __init__.py would work , but it turns to be not. I edit the corpus_manager.py, changing nmt.data_handling to data_utils and it can work.

Sorry, this is a mistake during cleaning up the source codes. Please delete this line.

In the data_handling/data_utils,there is one import from nmt.tools import Converter,i couldn't find nmt.tools.

cordercorder commented 1 year ago

Hi, I pushed a new commit to this repository and the changed files can be found at here. Does this script run well now?

altctrl00 commented 1 year ago

Thanks,it can run well now.