elastic / stream2es

Stream data into ES (Wikipedia, Twitter, stdin, or other ESes)
355 stars 62 forks source link

stream2es indexing of wikipedia fails at about 540K docs with IOException: unexpected end of stream #47

Closed ryanrozich closed 9 years ago

ryanrozich commented 9 years ago

I'm trying to import wikipedia into an elasticsearch index using stream2es and getting the following error after indexing about 540K docs (I've re-run this multiple times on both the current wiki dump and older dumps - every time I run it results in the same error):

2015-05-24T15:45:29.300+0000 DEBUG 79:19.539 112.8d/s 648.6K/s 536993 682 3156312 0 881514
2015-05-24T15:45:33.597+0000 DEBUG 79:23.836 113.0d/s 648.6K/s 538112 1119 3165607 0 882684
java.io.IOException: unexpected end of stream
    at org.elasticsearch.river.wikipedia.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:624)
    at org.elasticsearch.river.wikipedia.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:287)
    at org.elasticsearch.river.wikipedia.bzip2.CBZip2InputStream.setupNoRandPartA(CBZip2InputStream.java:844)
    at org.elasticsearch.river.wikipedia.bzip2.CBZip2InputStream.setupNoRandPartB(CBZip2InputStream.java:893)
    at org.elasticsearch.river.wikipedia.bzip2.CBZip2InputStream.read0(CBZip2InputStream.java:210)
    at org.elasticsearch.river.wikipedia.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:178)
    at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
    at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
    at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
    at java.io.InputStreamReader.read(InputStreamReader.java:184)
    at java.io.BufferedReader.read1(BufferedReader.java:210)
    at java.io.BufferedReader.read(BufferedReader.java:286)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1736)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipChar(XMLEntityScanner.java:1408)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2823)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
    at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
    at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
    at org.elasticsearch.river.wikipedia.support.WikiXMLSAXParser.parse(WikiXMLSAXParser.java:68)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:483)
    at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
    at clojure.lang.Reflector.invokeNoArgInstanceMember(Reflector.java:298)
    at stream2es.stream.wiki$fn__3723$fn__3724.invoke(wiki.clj:44)
    at stream2es.main$stream_BANG_.invoke(main.clj:241)
    at stream2es.main$main.invoke(main.clj:330)
    at stream2es.main$_main.doInvoke(main.clj:336)
    at clojure.lang.RestFn.applyTo(RestFn.java:137)
    at stream2es.main.main(Unknown Source)
2015-05-24T15:45:37.653+0000 ERROR unexpected exception: java.io.IOException: unexpected end of stream
2015-05-24T15:45:37.860+0000 INFO  79:28.099 112.9d/s 648.1K/s (3017.6mb) indexed 538112 streamed 539294 errors 0

I first tried this on the default wikipedia dump: http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

Just in case it was a problem with the lastest dump file, I also tried running using the previous months dump file as the --source parameter: http://dumps.wikimedia.org/enwiki/20150403/enwiki-20150403-pages-articles.xml.bz2

And got the exact same error at about the exact same doc count (~540k docs).

Here is the command I use to run stream2es:

./stream2es wiki --target http://<ES cluster>/wikipedia --log debug --source http://dumps.wikimedia.org/enwiki/20150403/enwiki-20150403-pages-articles.xml.bz2

(ES cluster url redacted above)

I am running on an ubuntu AWS instance, here is my version of Java

$ java -version
java version "1.8.0_25"
Java(TM) SE Runtime Environment (build 1.8.0_25-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)

I am running stream2es on a separate AWS instance from elasticsearch (our elasticsearch is hosted by qbox on AWS so we dont have shell access)

Any thoughts on how to get this to complete?

ryanrozich commented 9 years ago

I assume that if this completes successfully, that I should expect to have over 4.8MM documents in my index

http://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia

Has anyone gotten this to run successfully recently?

drewr commented 9 years ago

All told you'll end up with around 15M docs because of metadata.

I can't dig into it at the moment, but I've reproduced with stream2es 201502064c3af88 at home on an Ubuntu 15.04 Vivid Haswell workstation running java 1.7u79. I think this is a bug with the Java lib stream2es is using (the WikiXMLSAXParser class), maybe due to the dump format changing.

... same java.io.IOException: unexpected end of stream ...
2015-05-25T15:36:43.046-0500 INFO  52:37.500 171.7d/s 987.6K/s (3045.4mb) indexed 542027 streamed 542916 errors 0
ryanrozich commented 9 years ago

Thanks for the reply @drewr ! Do you know of a particular previous wiki dump that stream2es should work on? I could import an old dump just to get started.

ryanrozich commented 9 years ago

FYI - I opened an issue on the wikipedia river project that references this in case someone has a chance to look at the file format and that parser class.

https://github.com/elastic/elasticsearch-river-wikipedia/issues/49

Would love to find a way to get wikipedia imported into an ES index.

ryanrozich commented 9 years ago

As a quick update, I did try to download the dump locally and run stream2es on that and it did not fail at ~500k documents. It is still running and chugging along at about 13.5MM documents so far so I assume this should finish.

So, it would appear that streaming the dump file from the server casuses the issue since both @drewr and I could reproduce it and when I downloaded the file locally it appears to be working.

drewr commented 9 years ago

Ah, makes sense. Maybe Wikipedia shuts it down after a certain amount of data is transferred. I usually index from a local dump, which must be why I've not noticed.

Thanks for tracking it down @ryanrozich!

kanchana-padmanabhan commented 9 years ago

@ryanrozich I would appreciate if you can shed light on how you were able to load locally.

I am getting a strange silly error.

I am trying to load using local dump using the command specified in the repo ./stream2es wiki --max-docs 5 --source absolutepath/enwiki-20150205-pages-articles.xml.bz2 but I get

java.net.MalformedURLException: no protocol: at java.net.URL.(URL.java:585) at java.net.URL.(URL.java:482) at java.net.URL.(URL.java:431) at stream2es.http$components.invoke(http.clj:34) at stream2es.http$make_target.invoke(http.clj:74) at stream2es.stream.wiki$fn6613.invoke(wiki.clj:42) at stream2es.stream$fn5797$G57905804.invoke(stream.clj:22) at stream2es.bootstrap$boot.invoke(bootstrap.clj:70) at stream2es.main$_main.doInvoke(main.clj:335) at clojure.lang.RestFn.applyTo(RestFn.java:137) at stream2es.main.main(Unknown Source) 2015-06-19T10:16:37.993-0400 ERROR unexpected exception: java.net.MalformedURLException: no protocol:

If I provide the URL as file:///absolutepathtofile or file:absolutepathrofile it still throws an error.

java.net.UnknownHostException: file at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:178) at java.net.Socket.connect(Socket.java:579) at java.net.Socket.connect(Socket.java:528) at sun.net.ftp.impl.FtpClient.doConnect(FtpClient.java:958) at sun.net.ftp.impl.FtpClient.tryConnect(FtpClient.java:918) at sun.net.ftp.impl.FtpClient.connect(FtpClient.java:1013) at sun.net.ftp.impl.FtpClient.connect(FtpClient.java:999) at sun.net.www.protocol.ftp.FtpURLConnection.connect(FtpURLConnection.java:294) at sun.net.www.protocol.ftp.FtpURLConnection.getInputStream(FtpURLConnection.java:393) at java.net.URL.openStream(URL.java:1037) at org.elasticsearch.river.wikipedia.support.WikiXMLParser.getInputSource(WikiXMLParser.java:77) at org.elasticsearch.river.wikipedia.support.WikiXMLSAXParser.parse(WikiXMLSAXParser.java:68) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93) at clojure.lang.Reflector.invokeNoArgInstanceMember(Reflector.java:313) at stream2es.stream.wiki$fn6609$fn6610.invoke(wiki.clj:45) at stream2es.main$streamBANG.invoke(main.clj:241) at stream2es.main$main.invoke(main.clj:329) at stream2es.main$_main.doInvoke(main.clj:335) at clojure.lang.RestFn.applyTo(RestFn.java:137) at stream2es.main.main(Unknown Source)

I tried loading the "file URL" and opening it from another java file and it seems to open. This seems like a silly issue but I am hoping you can shed some light on how you got it loading. Thanks

drewr commented 9 years ago

@Smrithi16 This is a new bug. Working on it now!

kanchana-padmanabhan commented 9 years ago

Thanks @drewr! I am looking forward to the resolution!

drewr commented 9 years ago

@Smrithi16 One other note, that if you're in a pinch, you can use an older version:

cd /tmp
curl https://download.elastic.co/stream2es/stream2es-2014122282ace27 >stream2es; chmod +x stream2es
./stream2es wiki --source /path/to/enwiki-20150205-pages-articles.xml.bz2
kanchana-padmanabhan commented 9 years ago

Hi @drewr Thanks! I am going to try the older version now

kanchana-padmanabhan commented 9 years ago

@drewr Indexing with the older version now. Thanks a lot! Hopefully it finishes.