Closed popthink closed 7 years ago
Does it appear rejected in your logs? Because I just tried to reproduce with the Filesystem Committer and the file was committed. Can it be your target repository that rejects blank docs? A copy of your config may help.
<crawlerDefaults>
<httpClientFactory>
<headers>
<header name="Accept">*/*</header>
</headers>
</httpClientFactory>
<delay class="com.norconex.collector.http.delay.impl.GenericDelayResolver"
default="1500" scope="site" />
<urlNormalizer
class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
<normalizations>
removeFragment, lowerCaseSchemeHost, upperCaseEscapeSequence,
decodeUnreservedCharacters, removeDefaultPort,
encodeNonURICharacters, removeTrailingSlash
</normalizations>
</urlNormalizer>
<numThreads>10</numThreads>
<maxDepth>3</maxDepth>
<workDir>$workdir</workDir>
<robotsTxt ignore="true" />
<robotsMeta ignore="true" />
<orphansStrategy>DELETE</orphansStrategy>
<!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
<!-- Before 2.3.0: -->
<sitemap ignore="true" />
<!-- Since 2.3.0: -->
<sitemapResolverFactory ignore="true" />
<referenceFilters>
<filter class="$filterExtension" onMatch="exclude">jpg,gif,png,ico,css,jpeg
</filter>
</referenceFilters>
<recrawlableResolver
class="com.norconex.collector.http.recrawl.impl.GenericRecrawlableResolver"
sitemapSupport="never">
<minFrequency applyTo="reference" value="900000">.*
</minFrequency>
</recrawlableResolver>
<linkExtractors>
<extractor
class="com.norconex.collector.http.url.impl.GenericLinkExtractor"
keepReferrerData="true">
<tags>
<tag name="a" attribute="href" />
<tag name="frame" attribute="src" />
<tag name="script" attribute="src" />
<tag name="link" attribute="href" />
<tag name="iframe" attribute="src" />
<tag name="meta" attribute="http-equiv" />
<tag name="embed" attribute="src" />
<tag name="object" attribute="classid" />
<tag name="object" attribute="codebase" />
<tag name="applet" attribute="code" />
<tag name="applet" attribute="classid" />
<tag name="base" attribute="href" />
<tag name="body" attribute="backgroud" />
<tag name="area" attribute="href" />
</tags>
</extractor>
</linkExtractors>
<redirectURLProvider
class="com.norconex.collector.http.redirect.impl.GenericRedirectURLProvider"
fallbackCharset="utf-8" />
<metadataFilters>
<filter class="$filterRegexMeta" onMatch="exclude"
caseSensitive="false" field="Content-Type">.*css.*</filter>
</metadataFilters>
<importer>
<preParseHandlers>
<tagger
class="com.norconex.importer.handler.tagger.impl.TextPatternTagger">
<pattern field="html_source">.*</pattern>
<restrictTo field="document.contentType">text/html</restrictTo>
</tagger>
<filter class="com.norconex.importer.handler.filter.impl.ScriptFilter">
<script><![CDATA[
isFromIFrame = metadata.getString('collector.referrer-link-tag') == 'iframe.src';
depth = metadata.getInt('collector.depth');
/*return*/ (depth < 5 || isFromIFrame);
]]></script>
</filter>
</preParseHandlers>
</importer>
</crawlerDefaults>
Page Structure:
<main> =iframe|script==> <subpage>
Response :
Connection:close
Content-Length:0
Content-Type:text/html
Date:Fri, 25 Nov 2016 00:21:35 GMT
Server:Apache
Log:
2016-10-10 13:25:55,504 [pool-16-thread-9] INFO CrawlerEvent.REJECTED_IMPORT - REJECTED_IMPORT: http://log.tryweb.kr/PV_file/pv_insert.asp?site_name=test.com
In the log4j.properties file, if you change the log level for REJECTED_IMPORT to DEBUG, what do you get? It should tell you what rejected it.
Deleted Code :
<filter class="com.norconex.importer.handler.filter.impl.ScriptFilter">
<script><![CDATA[
isFromIFrame = metadata.getString('collector.referrer-link-tag') == 'iframe.src';
depth = metadata.getInt('collector.depth');
/*return*/ (depth < 5 || isFromIFrame);
]]></script>
</filter>
And... working fine! In my guess, Script occurred some errors.
Glad you have it working. I am closing but please re-open if you do see issues with zero-length documents.
I suggest you add a DebugTagger before your script to print the field values you are using in your script. That may help you troubleshoot why your document is rejected.
Hello :). Thank you.
It seems my crawler is rejecting importing documents of 0-length pages.
But I want to collect them and show in the commit result.
How can I handle this?
By using NumericMetaDataFilter and Empty? Include, field = Content-Length, eq = 0 || empty.
I tried but It didn't work.
Thank you :)
Best Regards.