Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Problems assigning correct "content" metadata from different file types. #630

Closed douglasharley-westat closed 5 years ago

douglasharley-westat commented 5 years ago

Hello,

Recently one of the internal sites I crawl and index-to Solr has changed implementation, and I cannot seem to get the Norconex stack cfg working as-desired. Specifically, the site's pages are .aspx now with a bunch of garbage header and footer content on every page that is getting indexed and showing-up in search results because the content metadata field contains everything, and not just the div contents I am targeting. The site also has a bunch of other file types like .pdf/.docx/etc. that also needs to be crawled, and the content should be extracted as per normal (i.e., everything in the doc), which is all working fine, it's just the damned .aspx pages I cannot get right and I've been banging my head against it for a few days without resolution (the .aspx files always have full page content, not the targeted div's content, even though the "pageContentDiv" metadata field I am capturing has the desired correct data), so I thought I'd ask the master. :)

Here's what my current attempted Norconex stack cfg is like, conceptually (lemme know if need more details):

<httpcollector id="${projectName} HTTP Collector">
    <crawlerDefaults>
        ...
        <referenceFilters>
            ... avoiding some infintite loop parameter traps/etc.
        </referenceFilters>
    </crawlerDefaults>
    <crawlers>
        <crawler id="${projectName} aspx files crawler">
            <documentFilters>
                <filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter" onMatch="include">aspx</filter>
            </documentFilters>
            <importer>
                <preParseHandlers>
                    <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
                        <dom selector="div.page_content" toField="pageContentDiv" />
                        <dom selector="div#page_content" toField="pageContentDiv" />
                    </tagger>
                </preParseHandlers>
                <postParseHandlers>
                    ... handling a bunch of junk pages that kill Solr
                </postParseHandlers>
            </importer>
            <committer class="com.norconex.committer.solr.SolrCommitter">
                <solrURL>${solrURL}</solrURL>
                <sourceContentField keep="true">pageContentDiv</sourceContentField>
                <queueDir>${workDir}/committer-queue</queueDir>
                <commitBatchSize>1</commitBatchSize>
                <maxRetries>5</maxRetries>
            </committer>
            <httpClientFactory>
                ... corporate NTLM stuff
            </httpClientFactory>
            <documentFetcher class="com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher">
                <exePath>${phantomjsExecutablePath}</exePath>
            </documentFetcher>
        </crawler>
        <crawler id="${projectName} non-aspx files crawler">
            <documentFilters>
                <filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter" onMatch="exclude">aspx</filter>
            </documentFilters>
            <importer>
                <postParseHandlers>
                    ... same garbage-handling
                </postParseHandlers>
            </importer>
            <committer class="com.norconex.committer.solr.SolrCommitter">
                <solrURL>${solrURL}</solrURL>
                <queueDir>${workDir}/committer-queue</queueDir>
                <commitBatchSize>1</commitBatchSize>
                <maxRetries>5</maxRetries>
            </committer>
            <httpClientFactory>
                ... corporate NTLM stuff
            <documentFetcher class="com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher">
                <exePath>${phantomjsExecutablePath}</exePath>
            </documentFetcher>
        </crawler>
    </crawlers>
</httpcollector>

How can I get the .aspx docs tageted div to be "content" while keeping normal content for all other types?

Thanks in advance for any guidance you might provide!

essiembre commented 5 years ago

The document extracted body text is sent as a "content" field by default in your Solr Committer. Since you say your pageContentDiv appears to be OK, you can tell the Solr committer to use that instead of the extracted text for the "content". Add this to your Solr Committer:

 <sourceContentField keep="false">pageContentDiv</sourceContentField>

Alternatively, if you want to modify the extracted content, is to use a mix of StripBeforeTransformer and StripAfterTransformer on your asp/HTML pages as pre-parse handlers.

Another option is to rely on your pageContentDiv field as you have it, and prevent content from being sent by truncating it (as a post-parse handler):

  <transformer class="com.norconex.importer.handler.transformer.impl.SubstringTransformer" end="0"/>

Let me know if one of these options work for you.

douglasharley-westat commented 5 years ago

As always, you save the day, SubstringTransformer did the trick. Much Thanks! Doug

essiembre commented 5 years ago

Great! Thanks for confirming.