RuntimeError: you must first build vocabulary before training the model

averyhiebert commented 3 years ago

When I try to train the model on the NTCIR-12 data, following the instructions in the readme, the call to gensim's FastText (line 30 in tangent_cft_model.py) throws "RuntimeError: you must first build vocabulary before training the model".

I'm using gensim 3.4.0 as instructed in requirements.txt. Is the version information incorrect, maybe? Or is there an undocumented step that I'm supposed to perform first? Thanks in advance for any help.

norbertstrzelecki commented 3 years ago

Hi @averyhiebert, when you use the dataset mentioned in the readme try to unpack the wpmath00000xx.tar.bz2 files into a separate folder(s), then create inside that folder another one called Articles and put the unpacked files here.

Then change a /Articles to Articles in DataReader/wiki_data_reader.py line 30

            temp_address = temp_address + "Articles"

Should work as a quick fix :)

MingchangLi commented 2 years ago

An importance step before running code is to unzip NTCIR-12 dataset and extract all tarballs under the sub-directories by for x in *.tar.bz2; do echo $x; tar xjf $x; done

SmallBall8 commented 2 years ago

嗨，@averyhiebert，当您使用自述文件中提到的数据集时，请尝试将文件解压缩到一个单独的文件夹中，然后在该文件夹中创建另一个称为的文件夹，并将解压缩的文件放在这里。wpmath00000xx.tar.bz2``Articles

然后在第 30 行中将 a 更改为/Articles``Articles``DataReader/wiki_data_reader.py
            temp_address = temp_address + "Articles"
应该作为快速修复:)

How to store the decompressed files? I'm sorry I didn't understand your meaning.

SmallBall8 commented 2 years ago

当我尝试在NTCIR-12数据上训练模型时，按照自述文件中的说明，对gensim的FastText（第30行）的调用会抛出“运行时错误：在训练模型之前必须首先构建词汇表”。tangent_cft_model.py

我正在按照要求中的说明使用 gensim 3.4.0.txt。版本信息是否不正确？还是我应该首先执行一个未记录的步骤？提前感谢您的任何帮助。

Hello, may I ask how you solved the problem at last? I met the same problem as you, but I still couldn't solve it by using the author's solution, I wonder where I made a mistake.

aiainui commented 2 years ago

当我尝试在NTCIR-12数据上训练模型时，按照自述文件中的说明，对gensim的FastText（第30行）的调用会抛出“运行时错误：在训练模型之前必须首先构建词汇表”。tangent_cft_model.py 我正在按照要求中的说明使用 gensim 3.4.0.txt。版本信息是否不正确？还是我应该首先执行一个未记录的步骤？提前感谢您的任何帮助。

Hello, may I ask how you solved the problem at last? I met the same problem as you, but I still couldn't solve it by using the author's solution, I wonder where I made a mistake.

Command can work, if your directory is like the following:

NTCIR-12_MathIR_Wikipedia_Corpus/ └── MathTagArticles ---└── wpmath0000001 -----├── 00_ERRORS_MATH -----├── 00_Log -----├── 00stats.csv -----├── 13(number).html -----├── 1995.html -----├── 1_(number).html -----├── 253_Mathilde.html -----├── 2D_computergraphics.html -----├── 2(number).html -----├── 3-sphere.html -----├── 61_Cygni.html

MeganJS commented 4 months ago

@aiainui May I ask how you made this work? My file structure looks exactly like this, but I'm still running into the issue. Thanks!

BehroozMansouri / TangentCFT

RuntimeError: you must first build vocabulary before training the model #5