DataMachine : unexpected end of stream

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1.Download wiki dump dated 2011-05-26 or 2011-05-04
2. Run JWPL_DATAMACHINE_0.6.0.jar with options english Categories 
Disambiguation_pages

What is the expected output? What do you see instead?
Expected is parsing to be completed and output folder to be filled with parsed 
content. I tried using bunzip2 to unzip pages-articles.xml.bz2, it worked fine. 
But running JWPL_DATAMACHINE_0.6.0 fails. Same thing happens for both wiki dump 
dated 2011-05-26 and 2011-05-04

Here is the complete stack trace

Loading XML bean definitions from class path resource 
[context/applicationContext.xml]
parse input dumps...
Discussions are available
unexpected end of stream

org.apache.tools.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStre
am.java:706)
org.apache.tools.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:289)
org.apache.tools.bzip2.CBZip2InputStream.setupNoRandPartA(CBZip2InputStream.java
:846)
org.apache.tools.bzip2.CBZip2InputStream.setupNoRandPartB(CBZip2InputStream.java
:902)
org.apache.tools.bzip2.CBZip2InputStream.read0(CBZip2InputStream.java:212)
org.apache.tools.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:180)
org.apache.xerces.impl.XMLEntityManager$RewindableInputStream.read(Unknown 
Source)
org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown 
Source)
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.
dispatch(Unknown Source)
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
Source)
org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
de.tudarmstadt.ukp.wikipedia.wikimachine.dump.xml.AbstractXmlDumpReader.readDump
(AbstractXmlDumpReader.java:207)
de.tudarmstadt.ukp.wikipedia.datamachine.dump.xml.XML2Binary.<init>(XML2Binary.j
ava:47)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.processInpu
tDump(DataMachineGenerator.java:65)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.start(DataM
achineGenerator.java:59)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.JWPLDataMachine.main(JWPLDataMac
hine.java:57)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
a:43)
java.lang.reflect.Method.invoke(Method.java:616)
org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:58
)

What version of the product are you using? On what operating system?
OS is Linux Ubantu 10 and Jwpl version is 0.6

Please reply soon any suggestion/fix. We are unable to proceed. Can I use jwpl 
for any wiki dump without any changes?

Thanks,
Shareeka

Original issue reported on code.google.com by ambha.ca...@gmail.com on 27 Jun 2011 at 7:47

GoogleCodeExporter commented 9 years ago

I just successfully converted the English dump from 2011-05-26 using 
JWPL_DATAMACHINE_0.6.0.jar with options english Categories Disambiguation_pages.

Two things to consider:
(i) always check the checksum after downloading a dump to avoid working with 
corrupt files
(ii) the line "Discussions are available" from your output indicates that you 
are using an unnecessary large dump (with unknown side-effects). Make sure you 
use the file "pages-articles.xml.bz2" and not the larger dumps if you are not 
interested in them.

-Torsten

Original comment by torsten....@gmail.com on 28 Jun 2011 at 12:17

Changed state: Irreproducible

GoogleCodeExporter commented 9 years ago

Thanks Torsten for the quick reply

I just want to verify with you the steps. Here is what I did
I downloaded below files from http://dumps.wikimedia.org/enwiki/20110526/ as 
mentioned in http://code.google.com/p/jwpl/wiki/DataMachine

enwiki-20110526-pages-articles.xml.bz2
enwiki-20110526-categorylinks.sql.gz
enwiki-20110526-pagelinks.sql.gz

Files got downloaded without any network disturbance. But the sizes of the 
downloads were not exactly matching with the ones mentioned on the download 
page.

I verified the checksum with md5sum command available in Ubantu 10.10. But the 
checksums are not matching. I tried twice downloading (deleted first one when 
second was loaded) and I got the same error. So I wonder how can it go wrong 
both the times...

Thanks

Original comment by ambha.ca...@gmail.com on 28 Jun 2011 at 6:36

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

How are you downloading the files?
Try using wget. It's less likely to produce corrupt files.

-Oliver

Original comment by oliver.ferschke on 28 Jun 2011 at 10:45

coriane / jwpl

DataMachine : unexpected end of stream #30