elastic / stream2es

Stream data into ES (Wikipedia, Twitter, stdin, or other ESes)
355 stars 62 forks source link

XML document structures must start and end within the same entity #55

Closed ghost closed 8 years ago

ghost commented 8 years ago

I ran on the local wikipedia dump enwiki-20151102-pages-articles-multistream.xml.bz2

./stream2es wiki --source [dir]/enwiki-20151102-pages-articles-multistream.xml.bz2

but got the error message:

[Fatal Error] :46:1: XML document structures must start and end within the same entity.
org.xml.sax.SAXParseException; lineNumber: 46; columnNumber: 1; XML document structures must start and end within the same entity.
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1239)
at org.elasticsearch.river.wikipedia.support.WikiXMLSAXParser.parse(WikiXMLSAXParser.java:68)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
at clojure.lang.Reflector.invokeNoArgInstanceMember(Reflector.java:313)
at stream2es.stream.wiki$fn__6625$fn__6626.invoke(wiki.clj:45)
at stream2es.main$stream_BANG_.invoke(main.clj:245)
at stream2es.main$main.invoke(main.clj:333)
at stream2es.main$_main.doInvoke(main.clj:339)
at clojure.lang.RestFn.applyTo(RestFn.java:137)
at stream2es.main.main(Unknown Source)

And I've tried other, simplewiki-20151102-pages-articles-multistream.xml.bz2, and simplewiki-20150901-pages-articles-multistream.xml.bz2

Same error occurs. Not sure how to fix it.

drewr commented 8 years ago

Sorry for not responding! Hopefully you found that you needed the enwiki-20151201-pages-articles-xml.bz2 file instead.

ghost commented 8 years ago

I found the problem lays in bz2 uncompression with xml format (didn’t check the details). It is not the problem for the code in stream2es, and feeding the uncompressed xml file instead of bz2 would get the job done. So I closed the issue.

Anyway, thanks for the reply.

On Dec 16, 2015, at 4:57 PM, Drew Raines notifications@github.com wrote:

Sorry for not responding! Hopefully you found that you needed the enwiki-20151201-pages-articles-xml.bz2 http://burnbit.com/torrent/427846/enwiki_20151201_pages_articles_xml_bz2 file instead.

— Reply to this email directly or view it on GitHub https://github.com/elastic/stream2es/issues/55#issuecomment-165256900.