Open averyhiebert opened 3 years ago
Hi @averyhiebert, when you use the dataset mentioned in the readme try to unpack the wpmath00000xx.tar.bz2
files into a separate folder(s), then create inside that folder another one called Articles
and put the unpacked files here.
Then change a /Articles
to Articles
in DataReader/wiki_data_reader.py
line 30
temp_address = temp_address + "Articles"
Should work as a quick fix :)
An importance step before running code is to unzip NTCIR-12 dataset and extract all tarballs under the sub-directories by for x in *.tar.bz2; do echo $x; tar xjf $x; done
嗨,@averyhiebert,当您使用自述文件中提到的数据集时,请尝试将文件解压缩到一个单独的文件夹中,然后在该文件夹中创建另一个称为的文件夹,并将解压缩的文件放在这里。
wpmath00000xx.tar.bz2``Articles
然后在第 30 行中将 a 更改为
/Articles``Articles``DataReader/wiki_data_reader.py
temp_address = temp_address + "Articles"
应该作为快速修复:)
How to store the decompressed files? I'm sorry I didn't understand your meaning.
当我尝试在NTCIR-12数据上训练模型时,按照自述文件中的说明,对gensim的FastText(第30行)的调用会抛出“运行时错误:在训练模型之前必须首先构建词汇表”。
tangent_cft_model.py
我正在按照要求中的说明使用 gensim 3.4.0.txt。版本信息是否不正确?还是我应该首先执行一个未记录的步骤?提前感谢您的任何帮助。
Hello, may I ask how you solved the problem at last? I met the same problem as you, but I still couldn't solve it by using the author's solution, I wonder where I made a mistake.
当我尝试在NTCIR-12数据上训练模型时,按照自述文件中的说明,对gensim的FastText(第30行)的调用会抛出“运行时错误:在训练模型之前必须首先构建词汇表”。
tangent_cft_model.py
我正在按照要求中的说明使用 gensim 3.4.0.txt。版本信息是否不正确?还是我应该首先执行一个未记录的步骤?提前感谢您的任何帮助。Hello, may I ask how you solved the problem at last? I met the same problem as you, but I still couldn't solve it by using the author's solution, I wonder where I made a mistake.
Command can work, if your directory is like the following:
NTCIR-12_MathIR_Wikipedia_Corpus/ └── MathTagArticles ---└── wpmath0000001 -----├── 00_ERRORS_MATH -----├── 00_Log -----├── 00stats.csv -----├── 13(number).html -----├── 1995.html -----├── 1_(number).html -----├── 253_Mathilde.html -----├── 2D_computergraphics.html -----├── 2(number).html -----├── 3-sphere.html -----├── 61_Cygni.html
@aiainui May I ask how you made this work? My file structure looks exactly like this, but I'm still running into the issue. Thanks!
When I try to train the model on the NTCIR-12 data, following the instructions in the readme, the call to gensim's FastText (line 30 in
tangent_cft_model.py
) throws "RuntimeError: you must first build vocabulary before training the model".I'm using gensim 3.4.0 as instructed in requirements.txt. Is the version information incorrect, maybe? Or is there an undocumented step that I'm supposed to perform first? Thanks in advance for any help.