Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Handling length 0-pages #313

Closed popthink closed 7 years ago

popthink commented 7 years ago

Hello :). Thank you.

It seems my crawler is rejecting importing documents of 0-length pages.

But I want to collect them and show in the commit result.

How can I handle this?

By using NumericMetaDataFilter and Empty? Include, field = Content-Length, eq = 0 || empty.

I tried but It didn't work.

Thank you :)

Best Regards.

essiembre commented 7 years ago

Does it appear rejected in your logs? Because I just tried to reproduce with the Filesystem Committer and the file was committed. Can it be your target repository that rejects blank docs? A copy of your config may help.

popthink commented 7 years ago
<crawlerDefaults>
        <httpClientFactory>
            <headers>
                <header name="Accept">*/*</header>
            </headers>
        </httpClientFactory>

        <delay class="com.norconex.collector.http.delay.impl.GenericDelayResolver"
            default="1500" scope="site" />
        <urlNormalizer
            class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
            <normalizations>
                removeFragment, lowerCaseSchemeHost, upperCaseEscapeSequence,
                decodeUnreservedCharacters, removeDefaultPort,
                encodeNonURICharacters, removeTrailingSlash
            </normalizations>
        </urlNormalizer>
        <numThreads>10</numThreads>
        <maxDepth>3</maxDepth>
        <workDir>$workdir</workDir>
        <robotsTxt ignore="true" />
        <robotsMeta ignore="true" />
        <orphansStrategy>DELETE</orphansStrategy>

        <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
        <!-- Before 2.3.0: -->
        <sitemap ignore="true" />
        <!-- Since 2.3.0: -->
        <sitemapResolverFactory ignore="true" />

        <referenceFilters>
            <filter class="$filterExtension" onMatch="exclude">jpg,gif,png,ico,css,jpeg
            </filter>
        </referenceFilters>

        <recrawlableResolver
            class="com.norconex.collector.http.recrawl.impl.GenericRecrawlableResolver"
            sitemapSupport="never">
            <minFrequency applyTo="reference" value="900000">.*
            </minFrequency>
        </recrawlableResolver>

        <linkExtractors>
            <extractor
                class="com.norconex.collector.http.url.impl.GenericLinkExtractor"
                keepReferrerData="true">
                <tags>
                    <tag name="a" attribute="href" />
                    <tag name="frame" attribute="src" />
                    <tag name="script" attribute="src" />
                    <tag name="link" attribute="href" />
                    <tag name="iframe" attribute="src" />
                    <tag name="meta" attribute="http-equiv" />
                    <tag name="embed" attribute="src" />
                    <tag name="object" attribute="classid" />
                    <tag name="object" attribute="codebase" />
                    <tag name="applet" attribute="code" />
                    <tag name="applet" attribute="classid" />
                    <tag name="base" attribute="href" />
                    <tag name="body" attribute="backgroud" />
                    <tag name="area" attribute="href" />
                </tags>
            </extractor>
        </linkExtractors>

        <redirectURLProvider
            class="com.norconex.collector.http.redirect.impl.GenericRedirectURLProvider"
            fallbackCharset="utf-8" />

        <metadataFilters>
            <filter class="$filterRegexMeta" onMatch="exclude"
                caseSensitive="false" field="Content-Type">.*css.*</filter>
        </metadataFilters>
        <importer>
            <preParseHandlers>
                <tagger
                    class="com.norconex.importer.handler.tagger.impl.TextPatternTagger">
                    <pattern field="html_source">.*</pattern>
                    <restrictTo field="document.contentType">text/html</restrictTo>
                </tagger>
                <filter class="com.norconex.importer.handler.filter.impl.ScriptFilter">
                    <script><![CDATA[
            isFromIFrame = metadata.getString('collector.referrer-link-tag') == 'iframe.src';
            depth = metadata.getInt('collector.depth');
            /*return*/ (depth < 5 || isFromIFrame);
      ]]></script>
                </filter>
            </preParseHandlers>
        </importer>
    </crawlerDefaults>
Page Structure:
 <main> =iframe|script==> <subpage>
Response : 
 Connection:close
Content-Length:0
Content-Type:text/html
Date:Fri, 25 Nov 2016 00:21:35 GMT
Server:Apache
Log:
2016-10-10 13:25:55,504 [pool-16-thread-9] INFO  CrawlerEvent.REJECTED_IMPORT -           REJECTED_IMPORT: http://log.tryweb.kr/PV_file/pv_insert.asp?site_name=test.com
essiembre commented 7 years ago

In the log4j.properties file, if you change the log level for REJECTED_IMPORT to DEBUG, what do you get? It should tell you what rejected it.

popthink commented 7 years ago

Deleted Code :

<filter class="com.norconex.importer.handler.filter.impl.ScriptFilter">
                    <script><![CDATA[
            isFromIFrame = metadata.getString('collector.referrer-link-tag') == 'iframe.src';
            depth = metadata.getInt('collector.depth');
            /*return*/ (depth < 5 || isFromIFrame);
      ]]></script>
                </filter>

And... working fine! In my guess, Script occurred some errors.

essiembre commented 7 years ago

Glad you have it working. I am closing but please re-open if you do see issues with zero-length documents.

I suggest you add a DebugTagger before your script to print the field values you are using in your script. That may help you troubleshoot why your document is rejected.