Closed phpsyscoder closed 3 years ago
Can you please provide a config file that reproduces the issue?
Hi There... thank You for Your reply.
This is my config. The error comes with htmlpages over >= ~2MB
content.
Please note that i'm using Java version 8
because with Version 11
the chromedriver always failed... i don't know why?
<?xml version="1.0" encoding="UTF-8"?>
#set($Domain = "www.sirup.com")
#set($dataDirRoot = "/home/spider/data-link-checker/v3")
#set($maxDepth = "-1")
#set($maxDocuments = "70000")
#set($delay = "0")
#set($numThreads = 5)
#set($ignoreRobotsCrawlDelay = "true")
#set($ignoreRobotsTXT = "false")
<httpcollector id="config-id">
<workDir>${dataDirRoot}</workDir>
<crawlers>
<crawler id="crawler-id">
<httpFetchers>
<fetcher class="com.norconex.collector.http.fetch.impl.webdriver.WebDriverHttpFetcher">
<browser>chrome</browser>
<httpSniffer>
<userAgent>spider</userAgent>
</httpSniffer>
</fetcher>
</httpFetchers>
<eventListeners>
<listener class="com.norconex.collector.http.crawler.event.impl.URLStatusCrawlerEventListener">
<statusCodes>104-199,201-299,306,309-999</statusCodes>
<outputDir>${dataDirRoot}/${Domain}/</outputDir>
<fileNamePrefix>brokenLinks</fileNamePrefix>
</listener>
<listener class="com.norconex.collector.http.crawler.event.impl.URLStatusCrawlerEventListener">
<statusCodes>100-103,200,300-305,307,308</statusCodes>
<outputDir>${dataDirRoot}/${Domain}/</outputDir>
<fileNamePrefix>successLinks</fileNamePrefix>
</listener>
</eventListeners>
<startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true" includeSubdomains="false">
<url>https://${Domain}/</url>
</startURLs>
<urlNormalizer class="GenericURLNormalizer">
<normalizations>removeFragment, lowerCaseSchemeHost, upperCaseEscapeSequence,
decodeUnreservedCharacters, removeDefaultPort,
encodeNonURICharacters</normalizations>
</urlNormalizer>
<delay default="${delay}" ignoreRobotsCrawlDelay="${ignoreRobotsCrawlDelay}" />
<numThreads>${numThreads}</numThreads>
<maxDepth>${maxDepth}</maxDepth>
<maxDocuments>${maxDocuments}</maxDocuments>
<orphansStrategy>PROCESS</orphansStrategy>
<robotsTxt ignore="${ignoreRobotsTXT}" />
<sitemapResolver ignore="false" />
<canonicalLinkDetector ignore="false" />
<referenceFilters>
<filter class="RegexReferenceFilter" onMatch="exclude">(.*/login/.*|.*\.(?i:jpe?g|png|ico|gif|webp|bmp|svg)([\?\#].*)?$)</filter>
</referenceFilters>
<importer>
<postParseHandlers />
</importer>
<committers />
</crawler>
</crawlers>
</httpcollector>
I was able to reproduce and I made a new HTTP Collector snapshot release with a solution.
It turns out it only happens when <httpSniffer>
is used. The third-party implementation used needs to buffer the content in memory and has a default of 2MB.
The new snapshot increases that default to 10MB and adds a new <maxBufferSize>
configuration option to specify a custom maximum size. You can set it up like this:
<httpFetchers>
<fetcher class="com.norconex.collector.http.fetch.impl.webdriver.WebDriverHttpFetcher">
<browser>chrome</browser>
<httpSniffer>
<userAgent>spider</userAgent>
<maxBufferSize>100MB</maxBufferSize>
</httpSniffer>
</fetcher>
</httpFetchers>
Please give it a try and confirm.
Dear Essiembre thank you for your response I was testing it and got the error:
2 XML configuration errors detected:
[XML] StartCommand: cvc-datatype-valid.1.2.1: '100MB' is not a valid value for 'integer'.
[XML] StartCommand: cvc-type.3.1.3: The value '100MB' of element 'maxBufferSize' is not valid.
ok, so i testing it again, but with bytes, ie <maxBufferSize>209715200</maxBufferSize>
(200MB in bytes)
This then will trigger no config error but the value in bytes (200MB) seems to be ignored because i still get then following error
04:32:30.362 [LittleProxy-0-ProxyToServerWorker-5] ERROR ProxyToServerConnection - (AWAITING_INITIAL) [id: 0x89c23f10, L:/172.17.0.10:39044 - R:w
ww.sirup.com/178.63.60.198:443]: Caught an exception on ProxyToServerConnection
io.netty.handler.codec.TooLongFrameException: HTTP content length exceeded 2097152 bytes.
ok... then... i did not set <maxBufferSize>
... and then it seems the default value of 10MB
is being used.
Any idea?
@Essiembre ...
mhh.. it works now fine. I don't know why the errors above not happens anymore.. i tihnk i made a mistake somewhere.
So... Thank you very much. Works as expected.
Hi There
i using Version 3 (for testing) with Google chrome as http fetcher.
All works fine but when documents exceeded 2097152 bytes i got errors (full log see below)
io.netty.handler.codec.TooLongFrameException: HTTP content length exceeded 2097152 bytes.
Any Idea how to incrase the allowed content length?
Errors: