Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

[DOMSplitter] StackOverflow with norconex-importer 2.5.2 #24

Closed sylvainroussy closed 8 years ago

sylvainroussy commented 8 years ago

with the following configuration (crawling depth 0):

<importer >
                <preParseHandlers>
                <splitter class="com.norconex.importer.handler.splitter.impl.DOMSplitter"
                        selector=".caption"  sourceCharset="UTF-8"/> 
[...]
<importer>

I get a StackOverflowError :

_java.lang.StackOverflowError at java.io.UnixFileSystem.getBooleanAttributes0(Native Method) at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242) at java.io.File.exists(File.java:819) at sun.misc.FileURLMapper.exists(FileURLMapper.java:78) at sun.misc.URLClassPath$JarLoader.getJarFile(URLClassPath.java:890) at sun.misc.URLClassPath$JarLoader.access$700(URLClassPath.java:756) at sun.misc.URLClassPath$JarLoader$1.run(URLClassPath.java:838) at sun.misc.URLClassPath$JarLoader$1.run(URLClassPath.java:831) at java.security.AccessController.doPrivileged(Native Method) at sun.misc.URLClassPath$JarLoader.ensureOpen(URLClassPath.java:830) at sun.misc.URLClassPath$JarLoader.(URLClassPath.java:803) at sun.misc.URLClassPath$JarLoader$3.run(URLClassPath.java:1057) at sun.misc.URLClassPath$JarLoader$3.run(URLClassPath.java:1054) at java.security.AccessController.doPrivileged(Native Method) at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1053) at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1013) at sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:983) at sun.misc.URLClassPath$1.next(URLClassPath.java:240) at sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:250) at java.net.URLClassLoader$3$1.run(URLClassLoader.java:601) at java.net.URLClassLoader$3$1.run(URLClassLoader.java:599) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader$3.next(URLClassLoader.java:598) at java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:623) at sun.misc.CompoundEnumeration.next(CompoundEnumeration.java:45) at sun.misc.CompoundEnumeration.hasMoreElements(CompoundEnumeration.java:54) at java.util.ServiceLoader$LazyIterator.hasNextService(ServiceLoader.java:354) at java.util.ServiceLoader$LazyIterator.hasNext(ServiceLoader.java:393) at java.util.ServiceLoader$1.hasNext(ServiceLoader.java:474) at javax.xml.parsers.FactoryFinder$1.run(FactoryFinder.java:293) at java.security.AccessController.doPrivileged(Native Method) at javax.xml.parsers.FactoryFinder.findServiceProvider(FactoryFinder.java:289) at javax.xml.parsers.FactoryFinder.find(FactoryFinder.java:267) at javax.xml.parsers.SAXParserFactory.newInstance(SAXParserFactory.java:127) at org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:51) at org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:42) at org.apache.tika.mime.MimeTypes.getMimeType(MimeTypes.java:206) at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:472) at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77) at com.norconex.importer.doc.ContentTypeDetector.doDetect(ContentTypeDetector.java:111) at com.norconex.importer.doc.ContentTypeDetector.detect(ContentTypeDetector.java:75) at com.norconex.importer.Importer.doImportDocument(Importer.java:233) at com.norconex.importer.Importer.importDocument(Importer.java:195) at com.norconex.importer.Importer.doImportDocument(Importer.java:280) at com.norconex.importer.Importer.importDocument(Importer.java:195) at com.norconex.importer.Importer.doImportDocument(Importer.java:280) at com.norconex.importer.Importer.importDocument(Importer.java:195) at com.norconex.importer.Importer.doImportDocument(Importer.java:280)

[...]_

essiembre commented 8 years ago

This exception can sometimes be caused by too much recursion. It is likely sometime related to your specific document and what is being matched exactly by your DOM selector. Can you attach a copy of the file that is causing this issue? Maybe there is a way to change your selector to avoid this (or otherwise provide a fix).

sylvainroussy commented 8 years ago

Hi, Ok my source page has changed since my last message, it was a test of this component. I close this ticket and take note about your explanation. Thanks.