HSLCY / VCWE

VCWE: Visual Character-Enhanced Word Embeddings (NAACL 2019)
https://www.aclweb.org/anthology/N19-1277
MIT License
15 stars 6 forks source link

Can you provide the whole wiki corpus? #2

Open onion1003 opened 4 years ago

onion1003 commented 4 years ago

Link https://dumps.wikimedia.org/zhwiki/20180520/ in the paper is invalid. Or can you provide the script that convert chars to images? Thank you!

sunny678 commented 4 years ago

Maybe you could use the latest wiki corpus: https://dumps.wikimedia.org/zhwiki/20190601/. Images can be obtained from some Chinese character image generation website, e.g. http://www.diyiziti.com. The processed Chinese character image data is: https://github.com/HSLCY/VCWE/tree/master/data/char_img_sub_mean.npy The char2ix file is: https://github.com/HSLCY/VCWE/tree/master/data/char2ix.npz

onion1003 commented 4 years ago

Maybe you could use the latest wiki corpus: https://dumps.wikimedia.org/zhwiki/20190601/. Images can be obtained from some Chinese character image generation website, e.g. http://www.diyiziti.com. The processed Chinese character image data is: https://github.com/HSLCY/VCWE/tree/master/data/char_img_sub_mean.npy The char2ix file is: https://github.com/HSLCY/VCWE/tree/master/data/char2ix.npz

Yeah, I can use the latest wiki corpus, but there are some pre-processing tricks during the image generation. Only word frequency large than 100 were kept in vocabulary.txt, so if I use the latest corpus, maybe I need to generate some new char images using the same pre-processing or I can't reproduce the results in the paper. Thanks anyway. 😁