facebookresearch / GENRE

Autoregressive Entity Retrieval
Other
765 stars 103 forks source link

wiki-redirects.txt file and tuto for preprocessing mgenre data #70

Open Denescor opened 2 years ago

Denescor commented 2 years ago

Hello

I'm trying to preprocess a wikidump for a custom mgenre workout but I don't have access to the {}wiki-redirects.txt file (with {} being the language of the wikidump).

This file is processed in _preprocesswikidata with the step option set to "redirects" to generate a pkl dictionary which will be used in _processanchor. It is searched in the _wikipediaredirect/target folder.

I couldn't find any script to generate this redirect file from a wikipedia dump, nor any explanation of the format of the file so I couldn't recreate the script.

Similarly, I haven't found a tutorial explaining how to arrange the different mgenre preprocessing scripts in order to create the datasets and start learning. I think I understood the role of each script and the order in which to execute them, but I wouldn't mind having an explanation from the start.

Thank you for your answers.

TommasoPetrolito commented 1 year ago

Hello

I'm trying to preprocess a wikidump for a custom mgenre workout but I don't have access to the {}wiki-redirects.txt file (with {} being the language of the wikidump).

This file is processed in _preprocesswikidata with the step option set to "redirects" to generate a pkl dictionary which will be used in _processanchor. It is searched in the _wikipediaredirect/target folder.

I couldn't find any script to generate this redirect file from a wikipedia dump, nor any explanation of the format of the file so I couldn't recreate the script.

Similarly, I haven't found a tutorial explaining how to arrange the different mgenre preprocessing scripts in order to create the datasets and start learning. I think I understood the role of each script and the order in which to execute them, but I wouldn't mind having an explanation from the start.

Thank you for your answers.

Sorry for this very late reply, I don't know if this can still be useful but I noticed that the path mentioned here (scripts_mgenre/preprocess_wikidata.py): with open( "wikipedia_redirect/target/{}wiki-redirects.txt".format(lang) ) as f: is quite similar to

Extracted 537441 redirects. Saved output: /home/hideki/edu.cmu.lti.wikipedia_redirect/target/jawiki-redirect.txt Done in 49 sec.
Make sure the extracted redirects are stored in a tab-separated .txt file. $ ls target -lh -rw-r--r-- 1 hideki users 250M 2013-01-25 16:48 enwiki-redirect.txt -rw-r--r-- 1 hideki users 25M 2013-01-25 16:25 jawiki-redirect.txt

Tha can be found here https://code.google.com/archive/p/wikipedia-redirect/

mGENRE code then opens an treats it like a tab separated file as well so that I suspect this might be the tool used for it:

            with open(
                "wikipedia_redirect/target/{}wiki-redirects.txt".format(lang)
            ) as f:
                for row in tqdm(csv.reader(f, delimiter="\t"), desc=lang):
                    title = unquote(row[1]).split("#")[0].replace("_", " ")

This is just an assumption anyway, not sure 100% about this.