Open Denescor opened 2 years ago
Hello
I'm trying to preprocess a wikidump for a custom mgenre workout but I don't have access to the {}wiki-redirects.txt file (with {} being the language of the wikidump).
This file is processed in _preprocesswikidata with the step option set to "redirects" to generate a pkl dictionary which will be used in _processanchor. It is searched in the _wikipediaredirect/target folder.
I couldn't find any script to generate this redirect file from a wikipedia dump, nor any explanation of the format of the file so I couldn't recreate the script.
Similarly, I haven't found a tutorial explaining how to arrange the different mgenre preprocessing scripts in order to create the datasets and start learning. I think I understood the role of each script and the order in which to execute them, but I wouldn't mind having an explanation from the start.
Thank you for your answers.
Sorry for this very late reply, I don't know if this can still be useful but I noticed that the path mentioned here (scripts_mgenre/preprocess_wikidata.py):
with open( "wikipedia_redirect/target/{}wiki-redirects.txt".format(lang) ) as f:
is quite similar to
Extracted 537441 redirects. Saved output: /home/hideki/edu.cmu.lti.wikipedia_redirect/target/jawiki-redirect.txt Done in 49 sec.
Make sure the extracted redirects are stored in a tab-separated .txt file. $ ls target -lh -rw-r--r-- 1 hideki users 250M 2013-01-25 16:48 enwiki-redirect.txt -rw-r--r-- 1 hideki users 25M 2013-01-25 16:25 jawiki-redirect.txt
Tha can be found here https://code.google.com/archive/p/wikipedia-redirect/
mGENRE code then opens an treats it like a tab separated file as well so that I suspect this might be the tool used for it:
with open(
"wikipedia_redirect/target/{}wiki-redirects.txt".format(lang)
) as f:
for row in tqdm(csv.reader(f, delimiter="\t"), desc=lang):
title = unquote(row[1]).split("#")[0].replace("_", " ")
This is just an assumption anyway, not sure 100% about this.
Hello
I'm trying to preprocess a wikidump for a custom mgenre workout but I don't have access to the {}wiki-redirects.txt file (with {} being the language of the wikidump).
This file is processed in _preprocesswikidata with the step option set to "redirects" to generate a pkl dictionary which will be used in _processanchor. It is searched in the _wikipediaredirect/target folder.
I couldn't find any script to generate this redirect file from a wikipedia dump, nor any explanation of the format of the file so I couldn't recreate the script.
Similarly, I haven't found a tutorial explaining how to arrange the different mgenre preprocessing scripts in order to create the datasets and start learning. I think I understood the role of each script and the order in which to execute them, but I wouldn't mind having an explanation from the start.
Thank you for your answers.