elastic / stream2es

Stream data into ES (Wikipedia, Twitter, stdin, or other ESes)
355 stars 62 forks source link

Update the protocol and host of the Wikipedia tool #51

Closed damienalexandre closed 9 years ago

damienalexandre commented 9 years ago

Hi there! Thanks for this tool, it's very nice to be able to populate an index quickly from different sources, love it!

The error

This PR fix the Empty InputStream error when using stream2es wiki, as shown bellow:

$ ./stream2es wiki --target http://localhost:9200/tmp --log debug          

2015-09-11T21:54:41.887+0200 DEBUG stream wiki from http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 to http://localhost:9200/tmp
java.io.IOException: Empty InputStream
    at org.elasticsearch.river.wikipedia.bzip2.CBZip2InputStream.init(CBZip2InputStream.java:229)
    at org.elasticsearch.river.wikipedia.bzip2.CBZip2InputStream.<init>(CBZip2InputStream.java:148)
    at org.elasticsearch.river.wikipedia.support.WikiXMLParser.getInputSource(WikiXMLParser.java:80)
    at org.elasticsearch.river.wikipedia.support.WikiXMLSAXParser.parse(WikiXMLSAXParser.java:68)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
    at clojure.lang.Reflector.invokeNoArgInstanceMember(Reflector.java:313)
    at stream2es.stream.wiki$fn__6612$fn__6613.invoke(wiki.clj:45)
    at stream2es.main$stream_BANG_.invoke(main.clj:241)
    at stream2es.main$main.invoke(main.clj:329)
    at stream2es.main$_main.doInvoke(main.clj:335)
    at clojure.lang.RestFn.applyTo(RestFn.java:137)
    at stream2es.main.main(Unknown Source)
2015-09-11T21:54:42.161+0200 ERROR unexpected exception: java.io.IOException: Empty InputStream
2015-09-11T21:54:42.326+0200 INFO  00:00.676 0.0d/s 0.0K/s (0.0mb) indexed 0 streamed 0 errors 0

The issue

It look like stream2es get an empty response when fetching http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 - which is normal as there is a redirection.

wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
HTTP request sent, awaiting response... 301 TLS Redirect
Location: https://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 [following]

HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 [following]

HTTP request sent, awaiting response... 200 OK
Length: 12258835493 (11G) [application/octet-stream]
Saving to: ‘enwiki-latest-pages-articles.xml.bz2’

The fix

There is two solution:

If this PR don't get merged you can still use the wiki import tool with the --source option:

./stream2es wiki --target http://localhost:9200/tmp --log debug --source https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
drewr commented 9 years ago

Thanks for reporting that @damienalexandre!

damienalexandre commented 9 years ago

Awesome, thanks for the merge!

Do you know when http://download.elasticsearch.org/stream2es/stream2es is going to be updated? Cheers.

drewr commented 9 years ago

Just now! :grin: