dnmilne / wikipediaminer

An open source toolkit for mining Wikipedia
130 stars 62 forks source link

Run the Dump Extractor fail #33

Closed QuytNguyen closed 7 years ago

QuytNguyen commented 7 years ago

I have followed error when run Dump Extractor by using the jar: "hadoop jar wikipedia-miner-hadoop.jar org.wikipedia.miner.extraction.DumpExtractor /input/viwiki-20170320-pages-articles-multistream.xml /input/languages.xml vi /input/en-sent.bin /output". How can i fix it? Exception in thread "main" javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,1] Message: Content is not allowed in prolog. at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:596) at org.wikipedia.miner.extraction.LanguageConfiguration.init(Unknown Source) at org.wikipedia.miner.extraction.LanguageConfiguration.(Unknown Source) at org.wikipedia.miner.extraction.DumpExtractor.configure(Unknown Source) at org.wikipedia.miner.extraction.DumpExtractor.(Unknown Source) at org.wikipedia.miner.extraction.DumpExtractor.main(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:221) at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

apohllo commented 7 years ago

@QuytNguyen have you managed to run the extractor with success?

QuytNguyen commented 7 years ago

This error is caused by BOM. I had converted languages.xml to UTF-8 and this error gone.