Handling length 0-pages

popthink commented 7 years ago

Hello :). Thank you.

It seems my crawler is rejecting importing documents of 0-length pages.

But I want to collect them and show in the commit result.

How can I handle this?

By using NumericMetaDataFilter and Empty? Include, field = Content-Length, eq = 0 || empty.

I tried but It didn't work.

Thank you :)

Best Regards.

essiembre commented 7 years ago

Does it appear rejected in your logs? Because I just tried to reproduce with the Filesystem Committer and the file was committed. Can it be your target repository that rejects blank docs? A copy of your config may help.

popthink commented 7 years ago

<crawlerDefaults>
        <httpClientFactory>
            <headers>
                <header name="Accept">*/*</header>
            </headers>
        </httpClientFactory>

        <delay class="com.norconex.collector.http.delay.impl.GenericDelayResolver"
            default="1500" scope="site" />
        <urlNormalizer
            class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
            <normalizations>
                removeFragment, lowerCaseSchemeHost, upperCaseEscapeSequence,
                decodeUnreservedCharacters, removeDefaultPort,
                encodeNonURICharacters, removeTrailingSlash
            </normalizations>
        </urlNormalizer>
        <numThreads>10</numThreads>
        <maxDepth>3</maxDepth>
        <workDir>$workdir</workDir>
        <robotsTxt ignore="true" />
        <robotsMeta ignore="true" />
        <orphansStrategy>DELETE</orphansStrategy>

        <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
        <!-- Before 2.3.0: -->
        <sitemap ignore="true" />
        <!-- Since 2.3.0: -->
        <sitemapResolverFactory ignore="true" />

        <referenceFilters>
            <filter class="$filterExtension" onMatch="exclude">jpg,gif,png,ico,css,jpeg
            </filter>
        </referenceFilters>

        <recrawlableResolver
            class="com.norconex.collector.http.recrawl.impl.GenericRecrawlableResolver"
            sitemapSupport="never">
            <minFrequency applyTo="reference" value="900000">.*
            </minFrequency>
        </recrawlableResolver>

        <linkExtractors>
            <extractor
                class="com.norconex.collector.http.url.impl.GenericLinkExtractor"
                keepReferrerData="true">
                <tags>
                    <tag name="a" attribute="href" />
                    <tag name="frame" attribute="src" />
                    <tag name="script" attribute="src" />
                    <tag name="link" attribute="href" />
                    <tag name="iframe" attribute="src" />
                    <tag name="meta" attribute="http-equiv" />
                    <tag name="embed" attribute="src" />
                    <tag name="object" attribute="classid" />
                    <tag name="object" attribute="codebase" />
                    <tag name="applet" attribute="code" />
                    <tag name="applet" attribute="classid" />
                    <tag name="base" attribute="href" />
                    <tag name="body" attribute="backgroud" />
                    <tag name="area" attribute="href" />
                </tags>
            </extractor>
        </linkExtractors>

        <redirectURLProvider
            class="com.norconex.collector.http.redirect.impl.GenericRedirectURLProvider"
            fallbackCharset="utf-8" />

        <metadataFilters>
            <filter class="$filterRegexMeta" onMatch="exclude"
                caseSensitive="false" field="Content-Type">.*css.*</filter>
        </metadataFilters>
        <importer>
            <preParseHandlers>
                <tagger
                    class="com.norconex.importer.handler.tagger.impl.TextPatternTagger">
                    <pattern field="html_source">.*</pattern>
                    <restrictTo field="document.contentType">text/html</restrictTo>
                </tagger>
                <filter class="com.norconex.importer.handler.filter.impl.ScriptFilter">
                    <script><![CDATA[
            isFromIFrame = metadata.getString('collector.referrer-link-tag') == 'iframe.src';
            depth = metadata.getInt('collector.depth');
            /*return*/ (depth < 5 || isFromIFrame);
      ]]></script>
                </filter>
            </preParseHandlers>
        </importer>
    </crawlerDefaults>

Page Structure:
 <main> =iframe|script==> <subpage>
Response : 
 Connection:close
Content-Length:0
Content-Type:text/html
Date:Fri, 25 Nov 2016 00:21:35 GMT
Server:Apache

Log:
2016-10-10 13:25:55,504 [pool-16-thread-9] INFO  CrawlerEvent.REJECTED_IMPORT -           REJECTED_IMPORT: http://log.tryweb.kr/PV_file/pv_insert.asp?site_name=test.com

essiembre commented 7 years ago

In the log4j.properties file, if you change the log level for REJECTED_IMPORT to DEBUG, what do you get? It should tell you what rejected it.

popthink commented 7 years ago

Deleted Code :

<filter class="com.norconex.importer.handler.filter.impl.ScriptFilter">
                    <script><![CDATA[
            isFromIFrame = metadata.getString('collector.referrer-link-tag') == 'iframe.src';
            depth = metadata.getInt('collector.depth');
            /*return*/ (depth < 5 || isFromIFrame);
      ]]></script>
                </filter>

And... working fine! In my guess, Script occurred some errors.

essiembre commented 7 years ago

Glad you have it working. I am closing but please re-open if you do see issues with zero-length documents.

I suggest you add a DebugTagger before your script to print the field values you are using in your script. That may help you troubleshoot why your document is rejected.

Norconex / crawlers

Handling length 0-pages #313