Open ToddMorrill opened 1 year ago
I have similar questions.
As discussed in #6 , I tried to prepare en_wiki.txt with following steps:
wikiextractor enwiki-latest-pages-articles.xml.bz2 --json --output processed
(example of output file: wiki_sample.txt)text
from each json item and save it into en_wiki.txt. (example of output file: en_wiki.txt)But there might be some steps I missed. Could you please provide a detailed instruction about how to generate the single Wikipedia text file?
Hi @ToddMorrill, I see in the file preprocess.py, the code handles tokenize line by line, but I don't know what each line contains, will each line contain each passage in wikipedia pages ? If you know what it contains let me know ?
I'm looking at tokenization_script.sh and I see that you're loading in en_XX.txt, which presumably contains all of Wikipedia's text in a single file. My question is, what text does this include? I'd imagine it includes paragraphs from pages but do you include section headers? Do you include Wikipedia edit discussion pages or just content pages? I can certainly prepare a similar file for the December 20th, 2018 Wikipedia dump (see my code here) but I want to follow your data preparation as closely as possible. Can you share
Please let me know if you have any questions for me. Thank you for sharing this great repo. I think this project holds a ton of promise.