adeydas / wikixmlj

Automatically exported from code.google.com/p/wikixmlj
0 stars 0 forks source link

cannot parse bzip2 file #17

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago

Since, I have space issues, I want to parse the .bz2 file itself without 
unzipping it.
I get the following error:

java -cp wikixmlj-r43.jar:.:bzip2.jar:.:xercesImpl-2.9.1.jar Test 
enwiki-20130503-pages-articles-multistream.xml.bz2 

[Fatal Error] :38:1: XML document structures must start and end within the same 
entity.
org.xml.sax.SAXParseException: XML document structures must start and end 
within the same entity.
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    at edu.jhu.nlp.wikipedia.WikiXMLSAXParser.parse(WikiXMLSAXParser.java:58)
    at Test.main(Test.java:25)

Does wikixmlj parse .bz2 files or does it work only on uncompressed xml files?? 

Original issue reported on code.google.com by hemantas...@gmail.com on 10 Jun 2013 at 7:02

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
The latest dumps are multi-stream bz2 files unlike earlier versions which were 
normal bz2 (can we call single stream??probably).. so wikixmlj assumes it to be 
normal bz2 file.. hence we get the above error...one solution that I found 
online was http://chaosinmotion.com/blog/?p=723

I am yet to try...will post soon about my success / failure

Original comment by hemantas...@gmail.com on 10 Jun 2013 at 11:37