Missing data - Githubissues

ezosa / M3L-topic-model

Multimodal and multilingual topic model with pretrained embeddings

MIT License

10 stars 1 forks source link

Missing data #2

Closed liuh236 closed 1 year ago

liuh236 commented 1 year ago

Hello! Thanks for your great job!

I find that the data folder is missing. If possible, can you release the dataset or the preprocessing script?

Thanks all.

ezosa commented 1 year ago

Hi! Thank you for your message. We've added the Wikipedia article titles and image urls in our train and test sets in the data directory. We will upload the preprocessing scripts soon.

liuh236 commented 1 year ago

thanks a lot!

liuh236 commented 1 year ago

btw, which folder will the preprocessing scripts be placed in? ^^

ezosa commented 1 year ago

The utils folder

liuh236 commented 1 year ago

Hi! Thank you for your message. We've added the Wikipedia article titles and image urls in our train and test sets in the data directory. We will upload the preprocessing scripts soon.

Hi! I am curious about how to extract the target text from the original Wikipedia Comparable Corpora. When will the preprocessing script be released? thanks again!

ezosa commented 1 year ago

Hi, We will upload it tomorrow.

liuh236 commented 1 year ago

Hi,

There is an error "tag mismatch" while using python extract titles for aligned multilingual Wikipedia articles (https://linguatools.org/tools/corpora/wikipedia-monolingual-corpora/). For example,

tree = ET.parse("wikicomp-2014_deen.xml") xml.parsers.expat.ExpatError: mismatched tag: line 58, column 31

How do you fix it?

ezosa commented 1 year ago

To align titles for multilingual articles, I extract titles from the crosslanguage link in the monolingual corpus, e.g. dewiki-20180920-corpus.xml Monolingual corpora can be found in https://linguatools.org/tools/corpora/wikipedia-monolingual-corpora/

liuh236 commented 1 year ago

Thanks for your response. I think there is something wrong with 2014 version (outdated) in the monolingual corpus and the 2018-version is OK. I will take a try.