ImportError: Cannot import WikiExtractor! You can download the "WikiExtractor.py" in https://github.com/attardi/wikiextractor to /home/ec2-user/gluon-nlp/gluon-nlp/scripts/datasets/pretrain_corpus/gluon-nlp/scripts/datasets/pretrain_corpus
To Reproduce
(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)
Description
I tried to run the code block that formats the downloaded the wikipedia text files. Link - https://github.com/dmlc/gluon-nlp/blob/master/scripts/datasets/pretrain_corpus/README.md#wikipedia , but got an error. Even though the WikiExtractor.py file is present in the mentioned directory.
Error Message
ImportError: Cannot import WikiExtractor! You can download the "WikiExtractor.py" in https://github.com/attardi/wikiextractor to /home/ec2-user/gluon-nlp/gluon-nlp/scripts/datasets/pretrain_corpus/gluon-nlp/scripts/datasets/pretrain_corpus
To Reproduce
(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)
Steps to reproduce
Create mxnet2.0 python 3.6 environment
conda create -n mxnet2_p36 python=3.6 source activate mxnet2_p36
Check Cuda version
nvcc --version
Install mxnet-cu100 2.0
python3 -m pip install -U --pre "mxnet-cu100>=2.0.0b20200926" -f https://dist.mxnet.io/python
Git clone from gluon nlp
git clone -b master https://github.com/dmlc/gluon-nlp.git
cd to gluon-nlp
cd gluon-nlp/
Install gluon-nlp
python3 -m pip install -U -e ."[extras]"
Check nlp_data
nlp_data help
Check nlp_process
nlp_process help
Download Hindi Wikipedia corpus
python3 prepare_wikipedia.py --mode download --lang hi --date latest -o ./
Properly format the text files
python3 prepare_wikipedia.py --mode format -i [path-to-wiki.xml.bz2] -o ./
Trying on Sagemaker ml.t2.medium instance.