De-tokenized text - Githubissues

Hi Thang, Have you looked at the data? It was a long time ago. What I remember is that we did not use any special tokenizer. For this reason, we didn't use JP, KO, or other languages for which a simple whitespace tokenizer wouldn't work. You can see the list of languages used in the paper as a footnote:

af, ar, bg, bn, bs, ca, cs, da, de, el, en, es, et, fa, fi, fr, he, hi, hr, hu, id, it, lt, lv, mk, ms, nl, no, pl, pt, ro, ru, sk, sl, sq, sv, ta, tl, tr, uk and vi.

This is the original paper that published wikiann dataset: https://aclanthology.org/P17-1178.pdf. In the paper there is a link to the original dataset that doesn't work anymore but you might be able to use archive.org's time machine https://web.archive.org/ to find it.

Also look at the data (not hosted on huggingface) here: https://www.dropbox.com/s/12h3qqog6q4bjve/panx_dataset.tar, if there is a ja dataset there, we have not tokenized it, I believe.

Apologies I can't help more than this.

afshinrahimi / mmner

De-tokenized text #6