Details for Wikipedia data formatting

ToddMorrill commented 1 year ago

I'm looking at tokenization_script.sh and I see that you're loading in en_XX.txt, which presumably contains all of Wikipedia's text in a single file. My question is, what text does this include? I'd imagine it includes paragraphs from pages but do you include section headers? Do you include Wikipedia edit discussion pages or just content pages? I can certainly prepare a similar file for the December 20th, 2018 Wikipedia dump (see my code here) but I want to follow your data preparation as closely as possible. Can you share

the data itself
the code you used to generate the single Wikipedia text file
OR some additional details about how you're generating the single Wikipedia text file?

Please let me know if you have any questions for me. Thank you for sharing this great repo. I think this project holds a ton of promise.

heyLinsir commented 1 year ago

I have similar questions.

As discussed in #6 , I tried to prepare en_wiki.txt with following steps:

wikiextractor enwiki-latest-pages-articles.xml.bz2 --json --output processed (example of output file: wiki_sample.txt)
Extract the value of text from each json item and save it into en_wiki.txt. (example of output file: en_wiki.txt)

But there might be some steps I missed. Could you please provide a detailed instruction about how to generate the single Wikipedia text file?

hieudx149 commented 1 year ago

Hi @ToddMorrill, I see in the file preprocess.py, the code handles tokenize line by line, but I don't know what each line contains, will each line contain each passage in wikipedia pages ? If you know what it contains let me know ?

facebookresearch / contriever

Details for Wikipedia data formatting #12