facebookresearch / XLM

PyTorch original implementation of Cross-lingual Language Model Pretraining.
Other
2.87k stars 495 forks source link

Get data generates empty files #319

Open srcarroll opened 3 years ago

srcarroll commented 3 years ago

I'm trying to go through the steps in your readme and getting stuck on the fastBPE part of the "Preparing the data" section. When I try to run

$FASTBPE learnbpe 30000 data/wiki/txt/en.train > $OUTPATH/codes

I get the following output

Loading vocabulary from data/wiki/txt/en.train ... Read 0 words (0 unique) from text file. Segmentation fault (core dumped)

checking en.train shows that it's empty. Any idea of the issue causing this?

srcarroll commented 3 years ago

You misunderstand my question. I'm asking why the script generates an empty file. It turns out that it's because the get-data-wiki script is completely riddled with problems and doesn't work as intended. I've already moved on

On Mon, Oct 26, 2020 at 6:25 AM arjunkoneru notifications@github.com wrote:

en.train is your training data for english. how can you learn the codes if it is empty?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/facebookresearch/XLM/issues/319#issuecomment-716543764, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL7CPJYTAD55AXTJMZOG3MTSMV2FLANCNFSM4SRATXAA .

saikoneru commented 3 years ago

Maybe you can use this Flores mono data and remove the part where it downloads nepali and sinhala

hxzd5568 commented 3 years ago

I met the same problem, and it is probably due to the project named wikiExtractor ,when the script runs to clean and tokenize, there is an error about losing defination. I finally solve it by following https://github.com/attardi/wikiextractor/issues/222, and don't forget modify the corresponding command in the get-data-wiki.sh