facebookresearch / MLDoc

A Corpus for Multilingual Document Classification in Eight Languages.
Other
153 stars 13 forks source link

run generate_documents.py #4

Open loretoparisi opened 5 years ago

loretoparisi commented 5 years ago

How to generate documents? I'm running generate_documents.py` in this way

python generate_documents.py --indices-file mldoc-indices/english.dev --output-filename out --rcv-dir /root/MultiLingualReutersCollection/EN/Index_EN-EN 

but it will not find the referenced XML file like /root/MultiLingualReutersCollection/EN/Index_EN-EN/19970308/428849newsML.xml

anassalamah commented 5 years ago

I was able to get it to run 'python generate_documents.py --indices-file mldoc-indices/french.train.1000 --output-filename rcv2_out/french.train.1000.out --rcv-dir ../../../data/reuters/rcv2/RCV2_Multilingual_Corpus/french/'

simonefrancia commented 5 years ago

@anassalamah sorry, where did you find open rcv2 dataset? Thank you

simonefrancia commented 5 years ago

Found there! https://gitlab.mi.hdm-stuttgart.de/griesshaber/nlp-corpora/commit/c38531e03c1de9f871f097fc21223748982dcb18