Closed iiekpig closed 8 years ago
It's a good question. It looks like it stopped around 3.1M docs which for English Wikipedia it isn't enough. You'll likely get more than 15 or 20M docs with all the content. Did you download the archive locally, or are you streaming it?
I downloaded the archive locally, It was a Chinese wikipedia archive. The link is https://dumps.wikimedia.org/zhwiki/20160407/zhwiki-20160407-pages-articles.xml.bz2. The archive is about 1.2G. When I upload to ES by stream2es, it was about 3.39G in the ES. But when I decompress files, it was about 5.54G on local computer. And the log shows the program is just stop at 2016-04-13T10:19:06.541+0800 DEBUG 03:09.432 579.5d/s 2417.3K/s 109767 1192 3164995 0 (This is just an example, not the actual running log. The actual running log is seem like it, but need very long time to run)
and it doesn't run any more, and it doesn't exit, too. So I can't make sure whether all the data finished uploading successfully.
My command is like this: ./stream2es wiki --source /home/zhwiki-20160407-pages-articles.xml.bz2 --target http://192.168.120.90:9200/encyclopedia_wikizh --log debug
Could I get any tips from the log on whether all the data finished uploading successfully?
Hm, I'm not sure. I've only tested on English, unfortunately. Very possible that it's a bug that hangs on a particular page. You could also try writing a simple parser to count how many XML docs are in the archive. Would be interesting to know.
I will have a try on that. Thank you for your help.
You're welcome! Going to close for now but feel free to open it back up if needed.
How can I know whether all the data from wikidumps were stored in the ElasticSearch, please? Because when I run the import data command, it was stopped at : 2016-04-13T10:19:02.842+0800 DEBUG 03:05.733 572.7d/s 2415.3K/s 106363 1013 3159442 0 2016-04-13T10:19:04.208+0800 DEBUG 03:07.094 574.3d/s 2414.5K/s 107454 1091 3213698 0 2016-04-13T10:19:05.427+0800 DEBUG 03:08.317 576.6d/s 2415.2K/s 108575 1121 3162484 0 2016-04-13T10:19:06.541+0800 DEBUG 03:09.432 579.5d/s 2417.3K/s 109767 1192 3164995 0
without exit. and I don't know whether it finished successfully.