Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

http collectors doesn't crawl in dynamically generates websites ? #476

Closed tdrobcsak closed 6 years ago

tdrobcsak commented 6 years ago

Hi

I would like to extract experts contact information from a site which dynamically generates list of available experts.

I saved these dynamically created sites into webpages-list containing following urls https://loc.salon-expert.hu/#sal_1 https://loc.salon-expert.hu/#sal_2 https://loc.salon-expert.hu/#sal_3 https://loc.salon-expert.hu/#sal_4 https://loc.salon-expert.hu/#sal_5 https://loc.salon-expert.hu/#sal_6 https://loc.salon-expert.hu/#sal_7 https://loc.salon-expert.hu/#sal_8 https://loc.salon-expert.hu/#sal_9

Can you help me understand what I'm doing wrong ? Here is my http collector's config.xml, however in the end result collector doesn't walk though out list of sites I collected in above file, thus it won't fetch any information as it stops by fetching https://loc.salon-expert.hu/ content.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>

<httpcollector id="SzalonExpert.hu Collector">

  #set($http = "com.norconex.collector.http")
  #set($core = "com.norconex.collector.core")
  #set($urlNormalizer   = "${http}.url.impl.GenericURLNormalizer")
  #set($filterExtension = "${core}.filter.impl.ExtensionReferenceFilter")
  #set($filterRegexRef  = "${core}.filter.impl.RegexReferenceFilter")

  <progressDir>${workdir}/progress</progressDir>
  <logsDir>${workdir}/logs</logsDir>

  <crawlerDefaults>

    <robotsTxt ignore="true" />

    <startURLs stayOnDomain="true">
       <urlsFile>./examples/Loreal-Fodraszat/webpage-list</urlsFile> 
    </startURLs>
    <urlNormalizer class="$urlNormalizer" />
    <numThreads>1</numThreads>
    <maxDepth>4</maxDepth>
    <workDir>$workdir</workDir>
   <!-- <orphansStrategy>DELETE</orphansStrategy>-->

     <!--<sitemapResolverFactory ignore="false" />-->

    <referenceFilters>
        <!--<filter class="$filterExtension" onMatch="exclude">jpg,gif,png,ico,css,js</filter>-->
        <filter class="$filterRegexRef" onMatch="include">https://loc.salon-expert.hu/#sal_\d+</filter>
    </referenceFilters>

    <!--<documentFetcher detectContentType="true" detectCharset="true"/> -->
    <documentFilters>
            <filter class="$filterRegexRef" onMatch="include">https://loc.salon-expert.hu/#sal_\d+</filter> 
    </documentFilters> 

  </crawlerDefaults>

  <crawlers>

    <crawler id="Expert Page ">
        <robotsTxt ignore="true" />
        <keepDownloads>true</keepDownloads> 
      #parse("shared/importer-config.xml")
      <committer class="com.norconex.committer.core.impl.MultiCommitter">
          <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
             <directory>${workdir}/crawledFilesMETA</directory>
          </committer>
              <committer class="com.norconex.committer.core.impl.XMLFileCommitter">
                <directory>${workdir}/crawledFilesXML</directory>
                <docsPerFile>1</docsPerFile>
                <pretty>true</pretty>
                <splitAddDelete>false</splitAddDelete>
              </committer>
    </committer>
    </crawler>

  </crawlers>

</httpcollector>
essiembre commented 6 years ago

At first glance, it seems to be the default behavior of stripping URL "fragments" (which are normally just anchors within the same page). In your case, if you need to preserve the # sign, have a look at GenericURLNormalizer.

You can overwrite the default behavior by taking out removeFragment from the default list of normalization rules, like this:

  <urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
    <normalizations>
        lowerCaseSchemeHost, upperCaseEscapeSequence,
        decodeUnreservedCharacters, removeDefaultPort, 
        encodeNonURICharacters, addWWW 
    </normalizations>
  </urlNormalizer>

Note though, that if your page is fully dynamic (javascript-driven), it will not solve all your problems. The HTTP Collector does not execute JavaScript. Luckily for those sites, you can integrate with PhantomJS. Have a look at PhantomJSDocumentFetcher.

tdrobcsak commented 6 years ago

Thanks for normilzer tip. I was able to check with each url, however it has dynamicaly created content which i would like to fetch, therefore in my config I added PhantomJSDocument fetcher with following tag

<documentFetcher class="${http}.fetch.impl.PhantomJSDocumentFetcher"
            detectContentType="false" detectCharset="false" screenshotEnabled="false">
          <exePath>phantomjs-2.1.1-macosx/bin/phantomjs</exePath>
            <scriptPath>scripts/phantom.js</scriptPath>
          <renderWaitTime>5000</renderWaitTime>
            <validStatusCodes>200</validStatusCodes>
           <notFoundStatusCodes>404</notFoundStatusCodes>
</documentFetcher> 

Yet I getting following error message:

ERROR [AbstractCrawler] Expert Page : Could not process document: https://loc.salon-expert.hu/#sal_1 (null)
java.lang.NullPointerException at com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher.createPhantomJSCommand(PhantomJSDocumentFetcher.java:1030) at com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher.fetchPhantomJSDocument(PhantomJSDocumentFetcher.java:799) at com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher.fetchDocument(PhantomJSDocumentFetcher.java:773) at com.norconex.collector.http.pipeline.importer.DocumentFetcherStage.executeStage(DocumentFetcherStage.java:42) at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31) at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91) at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:360) at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:538) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:419at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:812) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
    at java.base/java.lang.Thread.run(Thread.java:844)

Can you help me understand what i don't use right here?

essiembre commented 6 years ago

Which version are you using? If your log file is not too big, can you attach it? Also, can you try having absolute paths for the execPath and scriptPath tags?

tdrobcsak commented 6 years ago

Here is the attached log file.

Version: norconex-collector-http-2.8.0 Addedd full path, but result is the same..

Expert_32_Page32.log

essiembre commented 6 years ago

Please give a try to the snapshot version of HTTP Collector. It fixes that NullPointerException.

If you do not want to upgrade for some reason, a workaround can be to specify <screenshotDimensions> under the PhantomJSFetcher with an arbitrary value. That should also get rid of the exception.

Please confirm.

tdrobcsak commented 6 years ago

Thanks Pascal

I addedd <screenshotDimensions>"2560 x 1600"</screenshotDimensions> tag, and it did get rid of exception, however it wasn't able to fetch dynamicaly generated data... It seem that render time do not actually kick-in..

See my config file

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>

<httpcollector id="SzalonExpert.hu Collector">

  #set($http = "com.norconex.collector.http")
  #set($core = "com.norconex.collector.core")
  #set($urlNormalizer   = "${http}.url.impl.GenericURLNormalizer")
  #set($filterExtension = "${core}.filter.impl.ExtensionReferenceFilter")
  #set($filterRegexRef  = "${core}.filter.impl.RegexReferenceFilter")

  <progressDir>${workdir}/progress</progressDir>
  <logsDir>${workdir}/logs</logsDir>

  <crawlerDefaults>

    <robotsTxt ignore="true" />

    <startURLs stayOnDomain="true">
       <urlsFile>examples/Loreal-Fodraszat/webpage-list</urlsFile> 
       <!--<url>https://loc.salon-expert.hu/#adr_budapest;0,0,0,0,0,0,0,0 </url> -->
    </startURLs>
    <urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
    <normalizations>
        lowerCaseSchemeHost, upperCaseEscapeSequence,
        decodeUnreservedCharacters, removeDefaultPort, 
        encodeNonURICharacters 
    </normalizations>
  </urlNormalizer>
    <!--<urlNormalizer class="$urlNormalizer" />-->
    <numThreads>1</numThreads>
    <maxDepth>4</maxDepth>
    <workDir>$workdir</workDir>
   <!-- <orphansStrategy>DELETE</orphansStrategy>-->

     <!--<sitemapResolverFactory ignore="false" />-->

    <referenceFilters>
        <filter class="$filterExtension" onMatch="exclude">jpg,gif,png,ico,css,js</filter>
        <!--<filter class="$filterRegexRef" onMatch="include">https://loc.salon-expert.hu/#sal_\d+</filter> -->
    </referenceFilters>

    <!--<documentFetcher detectContentType="true" detectCharset="true"/> -->
    <!--<documentFilters>
            <filter class="$filterRegexRef" onMatch="include">https://loc.salon-expert.hu/#sal_\d+</filter> 
    </documentFilters> --> 

  </crawlerDefaults>

  <crawlers>

    <crawler id="Expert Page ">
        <robotsTxt ignore="true" />
        <keepDownloads>true</keepDownloads>

         <documentFetcher class="com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher"
              detectContentType="true" screenshotEnabled="true"> 
            <exePath>/Users/teoodordrobcsak/Downloads/norconex-collector-http-2.8.0/phantomjs-2.1.1-macosx/bin/phantomjs</exePath>
            <scriptPath>/Users/teoodordrobcsak/Downloads/norconex-collector-http-2.8.0/scripts/phantom.js</scriptPath>
            <renderWaitTime>5000</renderWaitTime>
            <referencePattern>^https://loc.salon-expert.hu/#sal.*</referencePattern>
            <screenshotDimensions>"2560 x 1600"</screenshotDimensions>
            <validStatusCodes>200</validStatusCodes>
            <notFoundStatusCodes>404</notFoundStatusCodes>
        </documentFetcher>
      #parse("shared/importer-config.xml")
      <committer class="com.norconex.committer.core.impl.MultiCommitter">
          <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
             <directory>${workdir}/crawledFilesMETA</directory>
          </committer>
              <committer class="com.norconex.committer.core.impl.XMLFileCommitter">
                <directory>${workdir}/crawledFilesXML</directory>
                <docsPerFile>1</docsPerFile>
                <pretty>true</pretty>
                <splitAddDelete>false</splitAddDelete>
              </committer>
    </committer>
    </crawler>

  </crawlers>

</httpcollector>
essiembre commented 6 years ago

Do you get any errors? There is at least another fix in the PhantomJSFetcher with the snapshot release. Please give it a try.

tdrobcsak commented 6 years ago

Hi Pascal

Thanks for trying to help, however I used the SNAPSHOT RELEASE and release and I notice two interesting things

1) If I kept the tag, I had same result as above it doesn't throw any other exeptions, however it still do not fetch any dynamically generated data 2) When I removed as from bugfix it was mentioned that it fixing null pointer error, it did fetch now data that dynamically created, however in begining it throwing some exception. Also attached logs! ERROR [PhantomJSDocumentFetcher] PhantomJS: https://stats.g.doubleclick.net/r/collect?v=1&aip=1&t=dc&_r=3&tid=UA-62480304-3&cid=561317613.1524634161&jid=1698485650&_gid=1280410387.1524773794&gjid=109555892&_v=j67&z=260275232: Operation canceled Caught and handled this exception : java.lang.NumberFormatException: For input string: "" at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.base/java.lang.Integer.parseInt(Integer.java:662) at java.base/java.lang.Integer.parseInt(Integer.java:770) at com.github.jaiimageio.impl.common.ImageUtil.processOnRegistration(ImageUtil.java:1401) at com.github.jaiimageio.impl.plugins.wbmp.WBMPImageReaderSpi.onRegistration(WBMPImageReaderSpi.java:96) at java.desktop/javax.imageio.spi.SubRegistry.registerServiceProvider(ServiceRegistry.java:788) at java.desktop/javax.imageio.spi.ServiceRegistry.registerServiceProvider(ServiceRegistry.java:330) at java.desktop/javax.imageio.spi.IIORegistry.registerApplicationClasspathSpis(IIORegistry.java:212) at java.desktop/javax.imageio.spi.IIORegistry.<init>(IIORegistry.java:136) at java.desktop/javax.imageio.spi.IIORegistry.getDefaultInstance(IIORegistry.java:157) at java.desktop/javax.imageio.ImageIO.<clinit>(ImageIO.java:66) at com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher.handleScreenshot(PhantomJSDocumentFetcher.java:889) at com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher.fetchPhantomJSDocument(PhantomJSDocumentFetcher.java:842) at com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher.fetchDocument(PhantomJSDocumentFetcher.java:773) at com.norconex.collector.http.pipeline.importer.DocumentFetcherStage.executeStage(DocumentFetcherStage.java:42) at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31) at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91) at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:360) at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:538) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:419) at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:815) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641) at java.base/java.lang.Thread.run(Thread.java:844) Caught and handled this exception : java.lang.NumberFormatException: For input string: "" at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.base/java.lang.Integer.parseInt(Integer.java:662) at java.base/java.lang.Integer.parseInt(Integer.java:770) at com.github.jaiimageio.impl.common.ImageUtil.processOnRegistration(ImageUtil.java:1401) at com.github.jaiimageio.impl.plugins.bmp.BMPImageReaderSpi.onRegistration(BMPImageReaderSpi.java:97) at java.desktop/javax.imageio.spi.SubRegistry.registerServiceProvider(ServiceRegistry.java:788) at java.desktop/javax.imageio.spi.ServiceRegistry.registerServiceProvider(ServiceRegistry.java:330) at java.desktop/javax.imageio.spi.IIORegistry.registerApplicationClasspathSpis(IIORegistry.java:212) at java.desktop/javax.imageio.spi.IIORegistry.<init>(IIORegistry.java:136) at java.desktop/javax.imageio.spi.IIORegistry.getDefaultInstance(IIORegistry.java:157) at java.desktop/javax.imageio.ImageIO.<clinit>(ImageIO.java:66) at com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher.handleScreenshot(PhantomJSDocumentFetcher.java:889) at com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher.fetchPhantomJSDocument(PhantomJSDocumentFetcher.java:842) at com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher.fetchDocument(PhantomJSDocumentFetcher.java:773) at com.norconex.collector.http.pipeline.importer.DocumentFetcherStage.executeStage(DocumentFetcherStage.java:42) at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31) at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91) at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:360) at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:538) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:419) at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:815) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641) at java.base/java.lang.Thread.run(Thread.java:844) Caught and handled this exception : java.lang.NumberFormatException: For input string: "" at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.base/java.lang.Integer.parseInt(Integer.java:662) at java.base/java.lang.Integer.parseInt(Integer.java:770) at com.github.jaiimageio.impl.common.ImageUtil.processOnRegistration(ImageUtil.java:1401) at com.github.jaiimageio.impl.plugins.wbmp.WBMPImageWriterSpi.onRegistration(WBMPImageWriterSpi.java:103) at java.desktop/javax.imageio.spi.SubRegistry.registerServiceProvider(ServiceRegistry.java:788) at java.desktop/javax.imageio.spi.ServiceRegistry.registerServiceProvider(ServiceRegistry.java:330) at java.desktop/javax.imageio.spi.IIORegistry.registerApplicationClasspathSpis(IIORegistry.java:212) at java.desktop/javax.imageio.spi.IIORegistry.<init>(IIORegistry.java:136) at java.desktop/javax.imageio.spi.IIORegistry.getDefaultInstance(IIORegistry.java:157) at java.desktop/javax.imageio.ImageIO.<clinit>(ImageIO.java:66) at com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher.handleScreenshot(PhantomJSDocumentFetcher.java:889) at com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher.fetchPhantomJSDocument(PhantomJSDocumentFetcher.java:842) at com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher.fetchDocument(PhantomJSDocumentFetcher.java:773) at com.norconex.collector.http.pipeline.importer.DocumentFetcherStage.executeStage(DocumentFetcherStage.java:42) at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31) at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91) at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:360) at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:538) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:419) at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:815) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641) at java.base/java.lang.Thread.run(Thread.java:844) Caught and handled this exception : java.lang.NumberFormatException: For input string: "" at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.base/java.lang.Integer.parseInt(Integer.java:662) at java.base/java.lang.Integer.parseInt(Integer.java:770) at com.github.jaiimageio.impl.common.ImageUtil.processOnRegistration(ImageUtil.java:1401) at com.github.jaiimageio.impl.plugins.bmp.BMPImageWriterSpi.onRegistration(BMPImageWriterSpi.java:105) at java.desktop/javax.imageio.spi.SubRegistry.registerServiceProvider(ServiceRegistry.java:788) at java.desktop/javax.imageio.spi.ServiceRegistry.registerServiceProvider(ServiceRegistry.java:330) at java.desktop/javax.imageio.spi.IIORegistry.registerApplicationClasspathSpis(IIORegistry.java:212) at java.desktop/javax.imageio.spi.IIORegistry.<init>(IIORegistry.java:136) at java.desktop/javax.imageio.spi.IIORegistry.getDefaultInstance(IIORegistry.java:157) at java.desktop/javax.imageio.ImageIO.<clinit>(ImageIO.java:66) at com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher.handleScreenshot(PhantomJSDocumentFetcher.java:889) at com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher.fetchPhantomJSDocument(PhantomJSDocumentFetcher.java:842) at com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher.fetchDocument(PhantomJSDocumentFetcher.java:773) at com.norconex.collector.http.pipeline.importer.DocumentFetcherStage.executeStage(DocumentFetcherStage.java:42) at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31) at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91) at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:360) at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:538) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:419) at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:815) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641) at java.base/java.lang.Thread.run(Thread.java:844) Caught and handled this exception : java.lang.NumberFormatException: For input string: "" at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.base/java.lang.Integer.parseInt(Integer.java:662) at java.base/java.lang.Integer.parseInt(Integer.java:770) at com.github.jaiimageio.impl.common.ImageUtil.processOnRegistration(ImageUtil.java:1401) at com.github.jaiimageio.impl.plugins.gif.GIFImageWriterSpi.onRegistration(GIFImageWriterSpi.java:140) at java.desktop/javax.imageio.spi.SubRegistry.registerServiceProvider(ServiceRegistry.java:788) at java.desktop/javax.imageio.spi.ServiceRegistry.registerServiceProvider(ServiceRegistry.java:330) at java.desktop/javax.imageio.spi.IIORegistry.registerApplicationClasspathSpis(IIORegistry.java:212) at java.desktop/javax.imageio.spi.IIORegistry.<init>(IIORegistry.java:136) at java.desktop/javax.imageio.spi.IIORegistry.getDefaultInstance(IIORegistry.java:157) at java.desktop/javax.imageio.ImageIO.<clinit>(ImageIO.java:66) at com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher.handleScreenshot(PhantomJSDocumentFetcher.java:889) at com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher.fetchPhantomJSDocument(PhantomJSDocumentFetcher.java:842) at com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher.fetchDocument(PhantomJSDocumentFetcher.java:773) at com.norconex.collector.http.pipeline.importer.DocumentFetcherStage.executeStage(DocumentFetcherStage.java:42) at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31) at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91) at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:360) at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:538) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:419) at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:815) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641) at java.base/java.lang.Thread.run(Thread.java:844)

Expert_32_Page32.log

essiembre commented 6 years ago

I cannot reproduce. I tried with your config (the part you shared) and you can see from the attached file the dynamic content was extracted properly for https://loc.salon-expert.hu/#sal_1: 2018-04-26T08-55-43-149_1.zip

tdrobcsak commented 6 years ago

Interesting, for me it still exception kicks in however I also noticed that on my mac when I run first time there is java window pops up and throw this message

/Library/Java/JavaVirtualMachines/jdk-9.0.4.jdk/Contents/Home/bin/java ; exit;
Teodors-MacBook-Pro:~ teoodordrobcsak$ /Library/Java/JavaVirtualMachines/jdk-9.0.4.jdk/Contents/Home/bin/java ; exit;
Usage: java [options] <mainclass> [args...]
           (to execute a class)
   or  java [options] -jar <jarfile> [args...]
           (to execute a jar file)
   or  java [options] -m <module>[/<mainclass>] [args...]
       java [options] --module <module>[/<mainclass>] [args...]
           (to execute the main class in a module)

 Arguments following the main class, -jar <jarfile>, -m or --module
 <module>/<mainclass> are passed as the arguments to main class.

 where options include:

    -d32      Deprecated, will be removed in a future release
    -d64      Deprecated, will be removed in a future release
    -cp <class search path of directories and zip/jar files>
    -classpath <class search path of directories and zip/jar files>
    --class-path <class search path of directories and zip/jar files>
                  A : separated list of directories, JAR archives,
                  and ZIP archives to search for class files.
    -p <module path>
    --module-path <module path>...
                  A : separated list of directories, each directory
                  is a directory of modules.
    --upgrade-module-path <module path>...
                  A : separated list of directories, each directory
                  is a directory of modules that replace upgradeable
                  modules in the runtime image
    --add-modules <module name>[,<module name>...]
                  root modules to resolve in addition to the initial module.
                  <module name> can also be ALL-DEFAULT, ALL-SYSTEM,
                  ALL-MODULE-PATH.
    --list-modules
                  list observable modules and exit
    -d <module name>
    --describe-module <module name>
                  describe a module and exit
    --dry-run     create VM and load main class but do not execute main method.
                  The --dry-run option may be useful for validating the
                  command-line options such as the module system configuration.
    --validate-modules
                  validate all modules and exit
                  The --validate-modules option may be useful for finding
                  conflicts and other errors with modules on the module path.
    -D<name>=<value>
                  set a system property
    -verbose:[class|module|gc|jni]
                  enable verbose output
    -version      print product version to the error stream and exit
    --version     print product version to the output stream and exit
    -showversion  print product version to the error stream and continue
    --show-version
                  print product version to the output stream and continue
    --show-module-resolution
                  show module resolution output during startup
    -? -h -help
                  print this help message to the error stream
    --help        print this help message to the output stream
    -X            print help on extra options to the error stream
    --help-extra  print help on extra options to the output stream
    -ea[:<packagename>...|:<classname>]
    -enableassertions[:<packagename>...|:<classname>]
                  enable assertions with specified granularity
    -da[:<packagename>...|:<classname>]
    -disableassertions[:<packagename>...|:<classname>]
                  disable assertions with specified granularity
    -esa | -enablesystemassertions
                  enable system assertions
    -dsa | -disablesystemassertions
                  disable system assertions
    -agentlib:<libname>[=<options>]
                  load native agent library <libname>, e.g. -agentlib:jdwp
                  see also -agentlib:jdwp=help
    -agentpath:<pathname>[=<options>]
                  load native agent library by full pathname
    -javaagent:<jarpath>[=<options>]
                  load Java programming language agent, see java.lang.instrument
    -splash:<imagepath>
                  show splash screen with specified image
                  HiDPI scaled images are automatically supported and used
                  if available. The unscaled image filename, e.g. image.ext,
                  should always be passed as the argument to the -splash option.
                  The most appropriate scaled image provided will be picked up
                  automatically.
                  See the SplashScreen API documentation for more information
    @argument files
                  one or more argument files containing options
    -disable-@files
                  prevent further argument file expansion
To specify an argument for a long option, you can use --<name>=<value> or
--<name> <value>.

logout
Saving session...
...copying shared history...
...saving history...truncating history files...
...completed.
Deleting expired sessions...59 completed.
tdrobcsak commented 6 years ago

I was wondering if this is some Java compatibility issue in here?

tdrobcsak commented 6 years ago

BTW thanks I was able to capture the dynimically created content with this SNAPSHOT release, dispite this exception throw..

essiembre commented 6 years ago

Given I cannot reproduce but you get the content now, shall we close?

The java error you are getting is when starting the HTTP Collector with the launch shell script? If so, maybe it needs to be modified to run on your mac? Have you tried with Java 8? I wonder if it may be something different with Java 9.

essiembre commented 6 years ago

Closing for lack of feedback. Feel free to reopen if the problem persists and you have more details.