facebookresearch / XLM

PyTorch original implementation of Cross-lingual Language Model Pretraining.
Other
2.87k stars 495 forks source link

Missing WikiExtractor.py file when running get-data-wiki.sh #335

Closed wayi1 closed 2 years ago

wayi1 commented 3 years ago

It seems like the file WikiExtractor.py is in the wrong path.

When the following line from the instructions is executed, there is an error about missing the WikiExtractor.py file.

Correcting the path in the get-data-wiki.sh script (line 45), just leads to an import error:

./get-data-wiki.sh en
. . .
Downloaded enwiki-latest-pages-articles.xml.bz2 in /home/ec2-user/power-sgd-home/XLM/data/wiki/bz2/enwiki-latest-pages-articles.xml.bz2

*** Cleaning and tokenizing en Wikipedia dump ... ***

Traceback (most recent call last):

  File "/home/ec2-user/power-sgd-home/XLM/tools/wikiextractor/wikiextractor/WikiExtractor.py", line 66, in <module>

    from .extract import Extractor, ignoreTag, define_template, acceptedNamespaces

ImportError: attempted relative import with no known parent package
mcriggs commented 2 years ago

You might trying installing WikiExtractor (https://github.com/attardi/wikiextractor) as a dependency.

wayi1 commented 2 years ago

@mcriggs Thanks for the suggestion!

This question was asked almost one year ago. Now I don't think I will use this repository very soon. Will try this suggestion if I have the chance to encounter the same problem.