Closed ryanrozich closed 9 years ago
I assume that if this completes successfully, that I should expect to have over 4.8MM documents in my index
http://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia
Has anyone gotten this to run successfully recently?
All told you'll end up with around 15M docs because of metadata.
I can't dig into it at the moment, but I've reproduced with stream2es 201502064c3af88
at home on an Ubuntu 15.04 Vivid Haswell workstation running java 1.7u79. I think this is a bug with the Java lib stream2es is using (the WikiXMLSAXParser
class), maybe due to the dump format changing.
... same java.io.IOException: unexpected end of stream ...
2015-05-25T15:36:43.046-0500 INFO 52:37.500 171.7d/s 987.6K/s (3045.4mb) indexed 542027 streamed 542916 errors 0
Thanks for the reply @drewr ! Do you know of a particular previous wiki dump that stream2es should work on? I could import an old dump just to get started.
FYI - I opened an issue on the wikipedia river project that references this in case someone has a chance to look at the file format and that parser class.
https://github.com/elastic/elasticsearch-river-wikipedia/issues/49
Would love to find a way to get wikipedia imported into an ES index.
As a quick update, I did try to download the dump locally and run stream2es on that and it did not fail at ~500k documents. It is still running and chugging along at about 13.5MM documents so far so I assume this should finish.
So, it would appear that streaming the dump file from the server casuses the issue since both @drewr and I could reproduce it and when I downloaded the file locally it appears to be working.
Ah, makes sense. Maybe Wikipedia shuts it down after a certain amount of data is transferred. I usually index from a local dump, which must be why I've not noticed.
Thanks for tracking it down @ryanrozich!
@ryanrozich I would appreciate if you can shed light on how you were able to load locally.
I am getting a strange silly error.
I am trying to load using local dump using the command specified in the repo ./stream2es wiki --max-docs 5 --source absolutepath/enwiki-20150205-pages-articles.xml.bz2 but I get
java.net.MalformedURLException: no protocol:
at java.net.URL.
If I provide the URL as file:///absolutepathtofile or file:absolutepathrofile it still throws an error.
java.net.UnknownHostException: file at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:178) at java.net.Socket.connect(Socket.java:579) at java.net.Socket.connect(Socket.java:528) at sun.net.ftp.impl.FtpClient.doConnect(FtpClient.java:958) at sun.net.ftp.impl.FtpClient.tryConnect(FtpClient.java:918) at sun.net.ftp.impl.FtpClient.connect(FtpClient.java:1013) at sun.net.ftp.impl.FtpClient.connect(FtpClient.java:999) at sun.net.www.protocol.ftp.FtpURLConnection.connect(FtpURLConnection.java:294) at sun.net.www.protocol.ftp.FtpURLConnection.getInputStream(FtpURLConnection.java:393) at java.net.URL.openStream(URL.java:1037) at org.elasticsearch.river.wikipedia.support.WikiXMLParser.getInputSource(WikiXMLParser.java:77) at org.elasticsearch.river.wikipedia.support.WikiXMLSAXParser.parse(WikiXMLSAXParser.java:68) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93) at clojure.lang.Reflector.invokeNoArgInstanceMember(Reflector.java:313) at stream2es.stream.wiki$fn6609$fn6610.invoke(wiki.clj:45) at stream2es.main$streamBANG.invoke(main.clj:241) at stream2es.main$main.invoke(main.clj:329) at stream2es.main$_main.doInvoke(main.clj:335) at clojure.lang.RestFn.applyTo(RestFn.java:137) at stream2es.main.main(Unknown Source)
I tried loading the "file URL" and opening it from another java file and it seems to open. This seems like a silly issue but I am hoping you can shed some light on how you got it loading. Thanks
@Smrithi16 This is a new bug. Working on it now!
Thanks @drewr! I am looking forward to the resolution!
@Smrithi16 One other note, that if you're in a pinch, you can use an older version:
cd /tmp
curl https://download.elastic.co/stream2es/stream2es-2014122282ace27 >stream2es; chmod +x stream2es
./stream2es wiki --source /path/to/enwiki-20150205-pages-articles.xml.bz2
Hi @drewr Thanks! I am going to try the older version now
@drewr Indexing with the older version now. Thanks a lot! Hopefully it finishes.
I'm trying to import wikipedia into an elasticsearch index using stream2es and getting the following error after indexing about 540K docs (I've re-run this multiple times on both the current wiki dump and older dumps - every time I run it results in the same error):
I first tried this on the default wikipedia dump: http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
Just in case it was a problem with the lastest dump file, I also tried running using the previous months dump file as the --source parameter: http://dumps.wikimedia.org/enwiki/20150403/enwiki-20150403-pages-articles.xml.bz2
And got the exact same error at about the exact same doc count (~540k docs).
Here is the command I use to run stream2es:
(ES cluster url redacted above)
I am running on an ubuntu AWS instance, here is my version of Java
I am running stream2es on a separate AWS instance from elasticsearch (our elasticsearch is hosted by qbox on AWS so we dont have shell access)
Any thoughts on how to get this to complete?