Norconex / committer-solr

Solr implementation of Norconex Committer. Should also work with any Solr-based products, such as LucidWorks.
https://opensource.norconex.com/committers/solr/
Apache License 2.0
3 stars 5 forks source link

Use metadata field for content in v3 #23

Closed FestPlatin closed 2 years ago

FestPlatin commented 2 years ago

Hi all, in version 2 of the Norconex Solr Commiter there is a configuration "sourceContenField". This field allow us, to use a metadata field for document content. In Version 3 this field doesn't seems to exist anymore. Is there any way to use our content field in the way we did in the previous version?

essiembre commented 2 years ago

With version 3, field mapping in the committer is supported, rendering the source content field option obsolete.

The source "content" field is always the document content (input stream). If you are keeping that content, it will be stored in the target id field (defaults to "content"). If you are ditching the document parsed content and want to store another field value in that field name instead, you can do it by adding this to your committer.

<fieldMappings>
    <mapping fromField="SOME_SOURCE_FIELD" toField="content"/>
  </fieldMappings>
FestPlatin commented 2 years ago

Hi @essiembre, thank you for the hint. We try it once again, but it didn't work when we write to the field content. Writing to another field (content_cleaned) seems to work. In the output at the end you see unnecessary line breaks and content that is out of index scope. It seems that the content is always the stream content and can not be overwritten.

<committer class="SolrCommitter">
    <solrURL>http://localhost:8983/solr/websearch</solrURL>
    <fieldMappings>
        <-- didn't work -->
        <mapping fromField="my_content" toField="content"/>
        <-- work -->
        <mapping fromField="my_content" toField="content_cleaned"/>
    </fieldMappings>
</committer>
{
   "id":"XXXXXXXXXX",
   "portal":"slt-relaunch",
   "title":"Einblicke in die Arbeit",
   "content_cleaned":"Einblicke in die Arbeit Ein Tag mit ... Die Arbeit der ...",
   "content":" \n \n \tHauptnavigation\n\tHauptinhalt\n\tServi....",
   "_version_":1723818927316795392
},
essiembre commented 2 years ago

If you do not explicitly take it out, a document "body" (i.e., content) will always be sent, even if you also map another field to the same target location. In your case, it seems you had both the metadata field and the content sent to the same target field (until you gave them different names).

To ditch the content entirely to rely 100% on your fields instead, you can add the following to your importer module (typically in the postParseHandlers section once your fields have been extracted).

<handler class="NoContentTransformer"/>
FestPlatin commented 2 years ago

Thanks for your feedback. But somehow we doesn't get it work. We have already tried with the NoContentransformer handler, but without success. We've attached a complete mini example

<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="config-id">
    <workDir>./work</workDir>

    <crawlers>
        <crawler id="crawler-id">
            <startURLs stayOnDomain="true" includeSubdomains="true" stayOnPort="true" stayOnProtocol="true">
                <!-- Random example url for my showcase  -->
                <url>https://iana.org/</url>
            </startURLs>

            <!-- Normalizes incoming URLs. -->
            <urlNormalizer class="GenericURLNormalizer">
                <normalizations>
                    removeFragment, lowerCaseSchemeHost, upperCaseEscapeSequence,
                    decodeUnreservedCharacters, removeDefaultPort,
                    encodeNonURICharacters
                </normalizations>
            </urlNormalizer>

            <delay default="3000"/>
            <numThreads>2</numThreads>
            <maxDepth>10</maxDepth>
            <maxDocuments>-1</maxDocuments>
            <orphansStrategy>PROCESS</orphansStrategy>
            <robotsTxt ignore="true"/>
            <robotsMeta ignore="true"/>
            <sitemapResolver ignore="true"/>
            <canonicalLinkDetector ignore="false"/>

            <importer>
                <preParseHandlers>
                    <!-- Remove navigation elements from HTML pages. -->
                    <handler class="DOMDeleteTransformer">
                        <dom selector="header"/>
                        <dom selector="footer"/>
                        <dom selector="nav"/>
                        <dom selector="noindex"/>
                    </handler>

                    <!-- We only need the content from the main field -->
                    <handler class="DOMTagger">
                        <dom selector="main" toField="my_content" onSet="replace"/>
                    </handler>
                </preParseHandlers>

                <postParseHandlers>
                    <handler class="NoContentTransformer"/>

                    <handler class="ReplaceTransformer">
                        <replace>
                            <valueMatcher method="regex" replaceAll="true">^\s*[\r\n]</valueMatcher>
                            <toValue/>
                        </replace>
                    </handler>

                    <!-- Make sure we are sending only one value per field. -->
                    <handler class="ForceSingleValueTagger" action="keepFirst">
                        <fieldMatcher method="csv">my_content,title</fieldMatcher>
                    </handler>

                    <!-- Keep only those fields and discard the rest. -->
                    <handler class="KeepOnlyTagger">
                        <fieldMatcher method="csv">my_content,title</fieldMatcher>
                    </handler>
                </postParseHandlers>
            </importer>

            <committers>
                <committer class="SolrCommitter">
                    <solrURL>http://localhost:8983/solr/websearch</solrURL>
                    <fieldMappings>
                        <mapping fromField="my_content" toField="content"/>
                        <mapping fromField="my_content" toField="content_cleaned"/>
                    </fieldMappings>
                </committer>
            </committers>
        </crawler>
    </crawlers>
</httpcollector>
essiembre commented 2 years ago

Thanks to your file I was able to reproduce and found a few issues with solutions for you.

Issue 1: The field you want for your content is named "content". It happens that this is also the default name of the document body target field. So the mapping is done as expected, but when time comes to set the body, it stores an empty content since you cleared the body content. Since the body content replaces whatever metadata value of the same name you may have, it replaces it with an empty string. Solution: tell the committer you do not want the body at all by setting the target content field to null. I can be done like this: <targetContentField/>.

Issue 2: You map my_content twice. Effectively, the second entry overwrites the first, so only content_cleaned will get through. Remove the second mapping. If you want to map the same field to multiple ones in Solr, copy it upfront, in the importer.

In the end, this is what worked for me:

<committer class="SolrCommitter">
    <solrURL>http://localhost:8983/solr/websearch</solrURL>
    <targetContentField/>
    <fieldMappings>
        <mapping fromField="my_content" toField="content"/>
    </fieldMappings>
</committer>

Maybe: There are cases to be made about one-to-many mappings in committers and treating a field as a multi-value field when the content gets added over an already existing field. If you think those would be useful to have, we can make it a feature request.

FestPlatin commented 2 years ago

Many thanks for your support. This solution finally solved our problem!

The one-to-many-mapping in the commiter is not needed. It was just a leftover from ealier testings. But thank you anyway for that hint as well.