dmlc / gluon-nlp

NLP made easy
https://nlp.gluon.ai/
Apache License 2.0
2.55k stars 538 forks source link

Issue in formatting the text wikipedia files #1416

Closed preeyank5 closed 3 years ago

preeyank5 commented 3 years ago

Description

I tried to run the code block that formats the downloaded the wikipedia text files. Link - https://github.com/dmlc/gluon-nlp/blob/master/scripts/datasets/pretrain_corpus/README.md#wikipedia , but got an error. Even though the WikiExtractor.py file is present in the mentioned directory.

Error Message

ImportError: Cannot import WikiExtractor! You can download the "WikiExtractor.py" in https://github.com/attardi/wikiextractor to /home/ec2-user/gluon-nlp/gluon-nlp/scripts/datasets/pretrain_corpus/gluon-nlp/scripts/datasets/pretrain_corpus

To Reproduce

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

Steps to reproduce

Create mxnet2.0 python 3.6 environment

conda create -n mxnet2_p36 python=3.6 source activate mxnet2_p36

Check Cuda version

nvcc --version

Install mxnet-cu100 2.0

python3 -m pip install -U --pre "mxnet-cu100>=2.0.0b20200926" -f https://dist.mxnet.io/python

Git clone from gluon nlp

git clone -b master https://github.com/dmlc/gluon-nlp.git

cd to gluon-nlp

cd gluon-nlp/

Install gluon-nlp

python3 -m pip install -U -e ."[extras]"

Check nlp_data

nlp_data help

Check nlp_process

nlp_process help

Download Hindi Wikipedia corpus

python3 prepare_wikipedia.py --mode download --lang hi --date latest -o ./

Properly format the text files

python3 prepare_wikipedia.py --mode format -i [path-to-wiki.xml.bz2] -o ./

Trying on Sagemaker ml.t2.medium instance.

sxjscience commented 3 years ago

@ZheyuYe Would you have time to take a look?

sxjscience commented 3 years ago

As confirmed by @preeyank5 , this has been fixed by https://github.com/dmlc/gluon-nlp/pull/1417 so I've closed the issue.