Closed FestPlatin closed 2 years ago
With version 3, field mapping in the committer is supported, rendering the source content field option obsolete.
The source "content" field is always the document content (input stream). If you are keeping that content, it will be stored in the target id field (defaults to "content"). If you are ditching the document parsed content and want to store another field value in that field name instead, you can do it by adding this to your committer.
<fieldMappings>
<mapping fromField="SOME_SOURCE_FIELD" toField="content"/>
</fieldMappings>
Hi @essiembre, thank you for the hint. We try it once again, but it didn't work when we write to the field content. Writing to another field (content_cleaned) seems to work. In the output at the end you see unnecessary line breaks and content that is out of index scope. It seems that the content is always the stream content and can not be overwritten.
<committer class="SolrCommitter">
<solrURL>http://localhost:8983/solr/websearch</solrURL>
<fieldMappings>
<-- didn't work -->
<mapping fromField="my_content" toField="content"/>
<-- work -->
<mapping fromField="my_content" toField="content_cleaned"/>
</fieldMappings>
</committer>
{
"id":"XXXXXXXXXX",
"portal":"slt-relaunch",
"title":"Einblicke in die Arbeit",
"content_cleaned":"Einblicke in die Arbeit Ein Tag mit ... Die Arbeit der ...",
"content":" \n \n \tHauptnavigation\n\tHauptinhalt\n\tServi....",
"_version_":1723818927316795392
},
If you do not explicitly take it out, a document "body" (i.e., content) will always be sent, even if you also map another field to the same target location. In your case, it seems you had both the metadata field and the content sent to the same target field (until you gave them different names).
To ditch the content entirely to rely 100% on your fields instead, you can add the following to your importer module (typically in the postParseHandlers
section once your fields have been extracted).
<handler class="NoContentTransformer"/>
Thanks for your feedback. But somehow we doesn't get it work. We have already tried with the NoContentransformer handler, but without success. We've attached a complete mini example
<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="config-id">
<workDir>./work</workDir>
<crawlers>
<crawler id="crawler-id">
<startURLs stayOnDomain="true" includeSubdomains="true" stayOnPort="true" stayOnProtocol="true">
<!-- Random example url for my showcase -->
<url>https://iana.org/</url>
</startURLs>
<!-- Normalizes incoming URLs. -->
<urlNormalizer class="GenericURLNormalizer">
<normalizations>
removeFragment, lowerCaseSchemeHost, upperCaseEscapeSequence,
decodeUnreservedCharacters, removeDefaultPort,
encodeNonURICharacters
</normalizations>
</urlNormalizer>
<delay default="3000"/>
<numThreads>2</numThreads>
<maxDepth>10</maxDepth>
<maxDocuments>-1</maxDocuments>
<orphansStrategy>PROCESS</orphansStrategy>
<robotsTxt ignore="true"/>
<robotsMeta ignore="true"/>
<sitemapResolver ignore="true"/>
<canonicalLinkDetector ignore="false"/>
<importer>
<preParseHandlers>
<!-- Remove navigation elements from HTML pages. -->
<handler class="DOMDeleteTransformer">
<dom selector="header"/>
<dom selector="footer"/>
<dom selector="nav"/>
<dom selector="noindex"/>
</handler>
<!-- We only need the content from the main field -->
<handler class="DOMTagger">
<dom selector="main" toField="my_content" onSet="replace"/>
</handler>
</preParseHandlers>
<postParseHandlers>
<handler class="NoContentTransformer"/>
<handler class="ReplaceTransformer">
<replace>
<valueMatcher method="regex" replaceAll="true">^\s*[\r\n]</valueMatcher>
<toValue/>
</replace>
</handler>
<!-- Make sure we are sending only one value per field. -->
<handler class="ForceSingleValueTagger" action="keepFirst">
<fieldMatcher method="csv">my_content,title</fieldMatcher>
</handler>
<!-- Keep only those fields and discard the rest. -->
<handler class="KeepOnlyTagger">
<fieldMatcher method="csv">my_content,title</fieldMatcher>
</handler>
</postParseHandlers>
</importer>
<committers>
<committer class="SolrCommitter">
<solrURL>http://localhost:8983/solr/websearch</solrURL>
<fieldMappings>
<mapping fromField="my_content" toField="content"/>
<mapping fromField="my_content" toField="content_cleaned"/>
</fieldMappings>
</committer>
</committers>
</crawler>
</crawlers>
</httpcollector>
Thanks to your file I was able to reproduce and found a few issues with solutions for you.
Issue 1:
The field you want for your content is named "content". It happens that this is also the default name of the document body target field. So the mapping is done as expected, but when time comes to set the body, it stores an empty content since you cleared the body content. Since the body content replaces whatever metadata value of the same name you may have, it replaces it with an empty string. Solution: tell the committer you do not want the body at all by setting the target content field to null. I can be done like this: <targetContentField/>
.
Issue 2:
You map my_content
twice. Effectively, the second entry overwrites the first, so only content_cleaned
will get through. Remove the second mapping. If you want to map the same field to multiple ones in Solr, copy it upfront, in the importer.
In the end, this is what worked for me:
<committer class="SolrCommitter">
<solrURL>http://localhost:8983/solr/websearch</solrURL>
<targetContentField/>
<fieldMappings>
<mapping fromField="my_content" toField="content"/>
</fieldMappings>
</committer>
Maybe: There are cases to be made about one-to-many mappings in committers and treating a field as a multi-value field when the content gets added over an already existing field. If you think those would be useful to have, we can make it a feature request.
Many thanks for your support. This solution finally solved our problem!
The one-to-many-mapping in the commiter is not needed. It was just a leftover from ealier testings. But thank you anyway for that hint as well.
Hi all, in version 2 of the Norconex Solr Commiter there is a configuration "sourceContenField". This field allow us, to use a metadata field for document content. In Version 3 this field doesn't seems to exist anymore. Is there any way to use our content field in the way we did in the previous version?