Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Version 3: Error: HTTP content length exceeded 2097152 bytes #751

Closed phpsyscoder closed 3 years ago

phpsyscoder commented 3 years ago

Hi There

i using Version 3 (for testing) with Google chrome as http fetcher.

All works fine but when documents exceeded 2097152 bytes i got errors (full log see below) io.netty.handler.codec.TooLongFrameException: HTTP content length exceeded 2097152 bytes.

Any Idea how to incrase the allowed content length?

Errors:

Starting ChromeDriver 90.0.4430.24 (4c6d850f087da467d926e8eddb76550aed655991-refs/branch-heads/4430@{#429}) on port 22159
Only local connections are allowed.
Please see https://chromedriver.chromium.org/security-considerations for suggestions on keeping ChromeDriver safe.
ChromeDriver was started successfully.
Apr 21, 2021 6:43:11 AM org.openqa.selenium.remote.ProtocolHandshake createSession
INFO: Detected dialect: W3C
06:43:11.955 [LittleProxy-0-ProxyToServerWorker-1] ERROR ProxyToServerConnection - (AWAITING_INITIAL) [id: 0x8b7c8fae, L:/172.17.0.12:35154 - R:www.example.com/xxx.xxx.30.114:443]: Caught an exception on ProxyToServerConnection
io.netty.handler.codec.TooLongFrameException: HTTP content length exceeded 2097152 bytes.
        at io.netty.handler.codec.http.HttpObjectAggregator.decode(HttpObjectAggregator.java:241) ~[netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.handler.codec.http.HttpObjectAggregator.decode(HttpObjectAggregator.java:89) ~[netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88) ~[netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:312) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:286) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at org.littleshoot.proxy.impl.ProxyConnection$BytesReadMonitor.channelRead(ProxyConnection.java:692) [littleproxy-1.1.0-beta-bmp-17.jar:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1296) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1087) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1122) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:491) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:430) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1302) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:646) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:581) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:460) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
06:43:11.959 [LittleProxy-0-ProxyToServerWorker-1] INFO  ProxyToServerConnection - (AWAITING_INITIAL) [id: 0x8b7c8fae, L:/172.17.0.12:35154 - R:www.example.com/xxx.xxx.30.114:443]: Disconnecting open connection to server
essiembre commented 3 years ago

Can you please provide a config file that reproduces the issue?

phpsyscoder commented 3 years ago

Hi There... thank You for Your reply.

This is my config. The error comes with htmlpages over >= ~2MB content.

Please note that i'm using Java version 8 because with Version 11 the chromedriver always failed... i don't know why?

<?xml version="1.0" encoding="UTF-8"?>

#set($Domain = "www.sirup.com")
#set($dataDirRoot  = "/home/spider/data-link-checker/v3")
#set($maxDepth  = "-1")
#set($maxDocuments  = "70000")
#set($delay  = "0")
#set($numThreads  = 5)
#set($ignoreRobotsCrawlDelay  = "true")
#set($ignoreRobotsTXT  = "false")

<httpcollector id="config-id">
   <workDir>${dataDirRoot}</workDir>
   <crawlers>
      <crawler id="crawler-id">
         <httpFetchers>
            <fetcher class="com.norconex.collector.http.fetch.impl.webdriver.WebDriverHttpFetcher">
               <browser>chrome</browser>
               <httpSniffer>
                  <userAgent>spider</userAgent>
               </httpSniffer>
            </fetcher>
         </httpFetchers>
         <eventListeners>
            <listener class="com.norconex.collector.http.crawler.event.impl.URLStatusCrawlerEventListener">
               <statusCodes>104-199,201-299,306,309-999</statusCodes>
               <outputDir>${dataDirRoot}/${Domain}/</outputDir>
               <fileNamePrefix>brokenLinks</fileNamePrefix>
            </listener>
            <listener class="com.norconex.collector.http.crawler.event.impl.URLStatusCrawlerEventListener">
               <statusCodes>100-103,200,300-305,307,308</statusCodes>
               <outputDir>${dataDirRoot}/${Domain}/</outputDir>
               <fileNamePrefix>successLinks</fileNamePrefix>
            </listener>
         </eventListeners>
         <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true" includeSubdomains="false">
            <url>https://${Domain}/</url>
         </startURLs>
         <urlNormalizer class="GenericURLNormalizer">
            <normalizations>removeFragment, lowerCaseSchemeHost, upperCaseEscapeSequence,
            decodeUnreservedCharacters, removeDefaultPort,
            encodeNonURICharacters</normalizations>
         </urlNormalizer>
         <delay default="${delay}" ignoreRobotsCrawlDelay="${ignoreRobotsCrawlDelay}" />
         <numThreads>${numThreads}</numThreads>
         <maxDepth>${maxDepth}</maxDepth>
         <maxDocuments>${maxDocuments}</maxDocuments>
         <orphansStrategy>PROCESS</orphansStrategy>
         <robotsTxt ignore="${ignoreRobotsTXT}" />
         <sitemapResolver ignore="false" />
         <canonicalLinkDetector ignore="false" />
         <referenceFilters>
            <filter class="RegexReferenceFilter" onMatch="exclude">(.*/login/.*|.*\.(?i:jpe?g|png|ico|gif|webp|bmp|svg)([\?\#].*)?$)</filter>
         </referenceFilters>
         <importer>
            <postParseHandlers />
         </importer>
         <committers />
      </crawler>
   </crawlers>
</httpcollector>
essiembre commented 3 years ago

I was able to reproduce and I made a new HTTP Collector snapshot release with a solution.

It turns out it only happens when <httpSniffer> is used. The third-party implementation used needs to buffer the content in memory and has a default of 2MB.

The new snapshot increases that default to 10MB and adds a new <maxBufferSize> configuration option to specify a custom maximum size. You can set it up like this:

         <httpFetchers>
            <fetcher class="com.norconex.collector.http.fetch.impl.webdriver.WebDriverHttpFetcher">
               <browser>chrome</browser>
               <httpSniffer>
                  <userAgent>spider</userAgent>
                  <maxBufferSize>100MB</maxBufferSize>
               </httpSniffer>
            </fetcher>
         </httpFetchers>

Please give it a try and confirm.

phpsyscoder commented 3 years ago

Dear Essiembre thank you for your response I was testing it and got the error:

2 XML configuration errors detected:
[XML] StartCommand: cvc-datatype-valid.1.2.1: '100MB' is not a valid value for 'integer'.
[XML] StartCommand: cvc-type.3.1.3: The value '100MB' of element 'maxBufferSize' is not valid.

ok, so i testing it again, but with bytes, ie <maxBufferSize>209715200</maxBufferSize> (200MB in bytes)

This then will trigger no config error but the value in bytes (200MB) seems to be ignored because i still get then following error

04:32:30.362 [LittleProxy-0-ProxyToServerWorker-5] ERROR ProxyToServerConnection - (AWAITING_INITIAL) [id: 0x89c23f10, L:/172.17.0.10:39044 - R:w
ww.sirup.com/178.63.60.198:443]: Caught an exception on ProxyToServerConnection
io.netty.handler.codec.TooLongFrameException: HTTP content length exceeded 2097152 bytes.

ok... then... i did not set <maxBufferSize> ... and then it seems the default value of 10MB is being used.

Any idea?

phpsyscoder commented 3 years ago

@Essiembre ...

mhh.. it works now fine. I don't know why the errors above not happens anymore.. i tihnk i made a mistake somewhere.

So... Thank you very much. Works as expected.