Closed liuh236 closed 1 year ago
Hi! Thank you for your message. We've added the Wikipedia article titles and image urls in our train and test sets in the data directory. We will upload the preprocessing scripts soon.
thanks a lot!
btw, which folder will the preprocessing scripts be placed in? ^^
The utils folder
Hi! Thank you for your message. We've added the Wikipedia article titles and image urls in our train and test sets in the data directory. We will upload the preprocessing scripts soon.
Hi, We will upload it tomorrow.
Hi,
There is an error "tag mismatch" while using python extract titles for aligned multilingual Wikipedia articles (https://linguatools.org/tools/corpora/wikipedia-monolingual-corpora/). For example,
tree = ET.parse("wikicomp-2014_deen.xml")
xml.parsers.expat.ExpatError: mismatched tag: line 58, column 31
How do you fix it?
To align titles for multilingual articles, I extract titles from the crosslanguage link in the monolingual corpus, e.g. dewiki-20180920-corpus.xml Monolingual corpora can be found in https://linguatools.org/tools/corpora/wikipedia-monolingual-corpora/
Thanks for your response. I think there is something wrong with 2014 version (outdated) in the monolingual corpus and the 2018-version is OK. I will take a try.
Hello! Thanks for your great job!
I find that the data folder is missing. If possible, can you release the dataset or the preprocessing script?
Thanks all.