Version 3: Error: HTTP content length exceeded 2097152 bytes

phpsyscoder commented 3 years ago

Hi There

i using Version 3 (for testing) with Google chrome as http fetcher.

All works fine but when documents exceeded 2097152 bytes i got errors (full log see below) io.netty.handler.codec.TooLongFrameException: HTTP content length exceeded 2097152 bytes.

Any Idea how to incrase the allowed content length?

Errors:

Starting ChromeDriver 90.0.4430.24 (4c6d850f087da467d926e8eddb76550aed655991-refs/branch-heads/4430@{#429}) on port 22159
Only local connections are allowed.
Please see https://chromedriver.chromium.org/security-considerations for suggestions on keeping ChromeDriver safe.
ChromeDriver was started successfully.
Apr 21, 2021 6:43:11 AM org.openqa.selenium.remote.ProtocolHandshake createSession
INFO: Detected dialect: W3C
06:43:11.955 [LittleProxy-0-ProxyToServerWorker-1] ERROR ProxyToServerConnection - (AWAITING_INITIAL) [id: 0x8b7c8fae, L:/172.17.0.12:35154 - R:www.example.com/xxx.xxx.30.114:443]: Caught an exception on ProxyToServerConnection
io.netty.handler.codec.TooLongFrameException: HTTP content length exceeded 2097152 bytes.
        at io.netty.handler.codec.http.HttpObjectAggregator.decode(HttpObjectAggregator.java:241) ~[netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.handler.codec.http.HttpObjectAggregator.decode(HttpObjectAggregator.java:89) ~[netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88) ~[netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:312) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:286) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at org.littleshoot.proxy.impl.ProxyConnection$BytesReadMonitor.channelRead(ProxyConnection.java:692) [littleproxy-1.1.0-beta-bmp-17.jar:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1296) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1087) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1122) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:491) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:430) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1302) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:646) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:581) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:460) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) [netty-all-4.0.51.Final.jar:4.0.51.Final]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
06:43:11.959 [LittleProxy-0-ProxyToServerWorker-1] INFO  ProxyToServerConnection - (AWAITING_INITIAL) [id: 0x8b7c8fae, L:/172.17.0.12:35154 - R:www.example.com/xxx.xxx.30.114:443]: Disconnecting open connection to server

essiembre commented 3 years ago

Can you please provide a config file that reproduces the issue?

phpsyscoder commented 3 years ago

Hi There... thank You for Your reply.

This is my config. The error comes with htmlpages over >= ~2MB content.

Please note that i'm using Java version 8 because with Version 11 the chromedriver always failed... i don't know why?

<?xml version="1.0" encoding="UTF-8"?>

#set($Domain = "www.sirup.com")
#set($dataDirRoot  = "/home/spider/data-link-checker/v3")
#set($maxDepth  = "-1")
#set($maxDocuments  = "70000")
#set($delay  = "0")
#set($numThreads  = 5)
#set($ignoreRobotsCrawlDelay  = "true")
#set($ignoreRobotsTXT  = "false")

<httpcollector id="config-id">
   <workDir>${dataDirRoot}</workDir>
   <crawlers>
      <crawler id="crawler-id">
         <httpFetchers>
            <fetcher class="com.norconex.collector.http.fetch.impl.webdriver.WebDriverHttpFetcher">
               <browser>chrome</browser>
               <httpSniffer>
                  <userAgent>spider</userAgent>
               </httpSniffer>
            </fetcher>
         </httpFetchers>
         <eventListeners>
            <listener class="com.norconex.collector.http.crawler.event.impl.URLStatusCrawlerEventListener">
               <statusCodes>104-199,201-299,306,309-999</statusCodes>
               <outputDir>${dataDirRoot}/${Domain}/</outputDir>
               <fileNamePrefix>brokenLinks</fileNamePrefix>
            </listener>
            <listener class="com.norconex.collector.http.crawler.event.impl.URLStatusCrawlerEventListener">
               <statusCodes>100-103,200,300-305,307,308</statusCodes>
               <outputDir>${dataDirRoot}/${Domain}/</outputDir>
               <fileNamePrefix>successLinks</fileNamePrefix>
            </listener>
         </eventListeners>
         <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true" includeSubdomains="false">
            <url>https://${Domain}/</url>
         </startURLs>
         <urlNormalizer class="GenericURLNormalizer">
            <normalizations>removeFragment, lowerCaseSchemeHost, upperCaseEscapeSequence,
            decodeUnreservedCharacters, removeDefaultPort,
            encodeNonURICharacters</normalizations>
         </urlNormalizer>
         <delay default="${delay}" ignoreRobotsCrawlDelay="${ignoreRobotsCrawlDelay}" />
         <numThreads>${numThreads}</numThreads>
         <maxDepth>${maxDepth}</maxDepth>
         <maxDocuments>${maxDocuments}</maxDocuments>
         <orphansStrategy>PROCESS</orphansStrategy>
         <robotsTxt ignore="${ignoreRobotsTXT}" />
         <sitemapResolver ignore="false" />
         <canonicalLinkDetector ignore="false" />
         <referenceFilters>
            <filter class="RegexReferenceFilter" onMatch="exclude">(.*/login/.*|.*\.(?i:jpe?g|png|ico|gif|webp|bmp|svg)([\?\#].*)?$)</filter>
         </referenceFilters>
         <importer>
            <postParseHandlers />
         </importer>
         <committers />
      </crawler>
   </crawlers>
</httpcollector>

essiembre commented 3 years ago

I was able to reproduce and I made a new HTTP Collector snapshot release with a solution.

It turns out it only happens when <httpSniffer> is used. The third-party implementation used needs to buffer the content in memory and has a default of 2MB.

The new snapshot increases that default to 10MB and adds a new <maxBufferSize> configuration option to specify a custom maximum size. You can set it up like this:

         <httpFetchers>
            <fetcher class="com.norconex.collector.http.fetch.impl.webdriver.WebDriverHttpFetcher">
               <browser>chrome</browser>
               <httpSniffer>
                  <userAgent>spider</userAgent>
                  <maxBufferSize>100MB</maxBufferSize>
               </httpSniffer>
            </fetcher>
         </httpFetchers>

Please give it a try and confirm.

phpsyscoder commented 3 years ago

Dear Essiembre thank you for your response I was testing it and got the error:

2 XML configuration errors detected:
[XML] StartCommand: cvc-datatype-valid.1.2.1: '100MB' is not a valid value for 'integer'.
[XML] StartCommand: cvc-type.3.1.3: The value '100MB' of element 'maxBufferSize' is not valid.

ok, so i testing it again, but with bytes, ie <maxBufferSize>209715200</maxBufferSize> (200MB in bytes)

This then will trigger no config error but the value in bytes (200MB) seems to be ignored because i still get then following error

04:32:30.362 [LittleProxy-0-ProxyToServerWorker-5] ERROR ProxyToServerConnection - (AWAITING_INITIAL) [id: 0x89c23f10, L:/172.17.0.10:39044 - R:w
ww.sirup.com/178.63.60.198:443]: Caught an exception on ProxyToServerConnection
io.netty.handler.codec.TooLongFrameException: HTTP content length exceeded 2097152 bytes.

ok... then... i did not set <maxBufferSize> ... and then it seems the default value of 10MB is being used.

Any idea?

phpsyscoder commented 3 years ago

@Essiembre ...

mhh.. it works now fine. I don't know why the errors above not happens anymore.. i tihnk i made a mistake somewhere.

So... Thank you very much. Works as expected.

Norconex / crawlers

Version 3: Error: HTTP content length exceeded 2097152 bytes #751