Closed douglasharley-westat closed 5 years ago
The document extracted body text is sent as a "content" field by default in your Solr Committer. Since you say your pageContentDiv
appears to be OK, you can tell the Solr committer to use that instead of the extracted text for the "content". Add this to your Solr Committer:
<sourceContentField keep="false">pageContentDiv</sourceContentField>
Alternatively, if you want to modify the extracted content, is to use a mix of StripBeforeTransformer and StripAfterTransformer on your asp/HTML pages as pre-parse handlers.
Another option is to rely on your pageContentDiv
field as you have it, and prevent content from being sent by truncating it (as a post-parse handler):
<transformer class="com.norconex.importer.handler.transformer.impl.SubstringTransformer" end="0"/>
Let me know if one of these options work for you.
As always, you save the day, SubstringTransformer did the trick. Much Thanks! Doug
Great! Thanks for confirming.
Hello,
Recently one of the internal sites I crawl and index-to Solr has changed implementation, and I cannot seem to get the Norconex stack cfg working as-desired. Specifically, the site's pages are .aspx now with a bunch of garbage header and footer content on every page that is getting indexed and showing-up in search results because the content metadata field contains everything, and not just the div contents I am targeting. The site also has a bunch of other file types like .pdf/.docx/etc. that also needs to be crawled, and the content should be extracted as per normal (i.e., everything in the doc), which is all working fine, it's just the damned .aspx pages I cannot get right and I've been banging my head against it for a few days without resolution (the .aspx files always have full page content, not the targeted div's content, even though the "pageContentDiv" metadata field I am capturing has the desired correct data), so I thought I'd ask the master. :)
Here's what my current attempted Norconex stack cfg is like, conceptually (lemme know if need more details):
How can I get the .aspx docs tageted div to be "content" while keeping normal content for all other types?
Thanks in advance for any guidance you might provide!