elastic / stream2es

Stream data into ES (Wikipedia, Twitter, stdin, or other ESes)
355 stars 62 forks source link

stream2es indexing of local wikipedia dump fails #50

Open stucker0530 opened 9 years ago

stucker0530 commented 9 years ago

I am getting the following error when attempting to ingest a local dump of the latest wikipedia. I am running ES 1.7.1 and stream2es 20150720170522978252e

[stream2es]$ ./stream2es wiki --max-docs 5 --source ./enwiki-latest-pages-articles1.xml.bz2 java.io.IOException: unexpected end of stream at org.elasticsearch.river.wikipedia.bzip2.CBZip2InputStream.bsGetBit(CBZip2InputStream.java:371) at org.elasticsearch.river.wikipedia.bzip2.CBZip2InputStream.recvDecodingTables(CBZip2InputStream.java:476) at org.elasticsearch.river.wikipedia.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:550) at org.elasticsearch.river.wikipedia.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:287) at org.elasticsearch.river.wikipedia.bzip2.CBZip2InputStream.init(CBZip2InputStream.java:246) at org.elasticsearch.river.wikipedia.bzip2.CBZip2InputStream.(CBZip2InputStream.java:148) at org.elasticsearch.river.wikipedia.support.WikiXMLParser.getInputSource(WikiXMLParser.java:80) at org.elasticsearch.river.wikipedia.support.WikiXMLSAXParser.parse(WikiXMLSAXParser.java:68) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93) at clojure.lang.Reflector.invokeNoArgInstanceMember(Reflector.java:313) at stream2es.stream.wiki$fn6612$fn6613.invoke(wiki.clj:45) at stream2es.main$streamBANG.invoke(main.clj:241) at stream2es.main$main.invoke(main.clj:329) at stream2es.main$_main.doInvoke(main.clj:335) at clojure.lang.RestFn.applyTo(RestFn.java:137) at stream2es.main.main(Unknown Source) 2015-09-11T11:13:32.937-0600 ERROR unexpected exception: java.io.IOException: unexpected end of stream 2015-09-11T11:13:33.056-0600 INFO 00:00.208 0.0d/s 0.0K/s (0.0mb) indexed 0 streamed 0 errors 0 [stream2es]$

drewr commented 9 years ago

Which dump did you download? You'd want this one:

Jbrunn commented 9 years ago

I'm having the same issue (without the max-docs option). I've tried using both of the dumps that you suggested. I'm on OSx, if that makes any difference. I have also turned sleep off to eliminate that as a possible issue. The bz2 dump you suggested did gave me the highest number of documents successfully processed thus far at 534,792. Any guidance would be appreciated.

funnydevnull commented 8 years ago

I'm using the dump enwiki-20140707-pages-articles.xml.bz2 and so far its working (but only 62k articles in so far).