Allow changing character case of field names

essiembre commented 8 years ago

Request created from @comschmid comment in issue #163.

Allows to change the character case of field names, like CharacterCaseTagger does for field values.

essiembre commented 8 years ago

This feature is included in the latest snapshot. The CharacterCaseTagger now supports a new attribute called applyTo which takes one of the following values: value (default), field, or both. An example:

<tagger class="com.norconex.importer.handler.tagger.impl.CharacterCaseTagger">
      <characterCase type="upper" fieldName="myField" applyTo="field" />
</tagger>

This will result in every case-permutations of "myField" to be internally stored (and committed) as "MYFIELD". Let me know if that works for you.

comschmid commented 8 years ago

Unfortunately, it does not change the case of the field name with the following configuration just before the KeepOnlyTagger:

<!-- Lower-case some document meta fields -->
<tagger class="${importer}.handler.tagger.impl.CharacterCaseTagger">
    <characterCase fieldName="keywords" type="lower" applyTo="field" />
    <characterCase fieldName="description" type="lower" applyTo="field" />
</tagger>

I double-checked it with the DebugTagger, that still shows the field name 'Keywords' with a corresponding value.

essiembre commented 8 years ago

Did you add that as a pre-handler or post-handler? Because I suspect "Keywords" is added for you AFTER this tagger did his case conversion. Make sure you put it last.

comschmid commented 8 years ago

Yes, I did, but even if it is at the very end it does not work and I think I know why, because when setting fieldName="Keywords" (with the actual field case) it works:

In CharacterCaseTagger.java on line 101 you iterate over the field names with the exact case as configured with the attribute fieldName. Therefore, "keywords" will never return a CaseChangeDetails one line further, because it doesn't exist in this document.

Just an idea: probably the easiest would be to set a case flag in the meta field extractors/parsers (where the field name gets created) but you see the big picture better than me.

essiembre commented 8 years ago

Can you attach the importer portion of your configuration? I will look into this.

comschmid commented 8 years ago

Thanks, following the importer part:

     <importer>
        <!-- 
        <tempDir>defined/in/code</tempDir>
        <maxFileCacheSize></maxFileCacheSize>
        <maxFilePoolCacheSize></maxFilePoolCacheSize>
        <parseErrorsSaveDir>defined/in/code</parseErrorsSaveDir>
         -->

        <preParseHandlers>
              <tagger class="${importer}.handler.tagger.impl.DocumentLengthTagger" field="document.size.preparse" />

            <!-- These tags can be mixed, in the desired order of execution. -->
                <!-- 
            <tagger class="..." />
            <transformer class="..." />
            <filter class="..." />
            <splitter class="..." />   -->     
        </preParseHandlers>

        <!-- <documentParserFactory class="..." /> -->
      <postParseHandlers>
        <!-- These tags can be mixed, in the desired order of execution. -->

        <!-- follow HTML meta-equiv redirects without indexing original page -->
        <filter class="${importer}.handler.filter.impl.RegexMetadataFilter" onMatch="exclude" property="refresh">.*</filter>

        <!-- Collapse spaces and line feeds -->
        <transformer class="${importer}.handler.transformer.impl.ReduceConsecutivesTransformer" caseSensitive="true">
          <reduce>\s</reduce>
          <reduce>\n</reduce>
          <reduce>\r</reduce>
          <reduce>\t</reduce>
          <reduce>\n\r</reduce>
          <reduce>\r\n</reduce>
          <reduce>\s\n</reduce>
          <reduce>\s\r</reduce>
          <reduce>\s\r\n</reduce>
          <reduce>\s\n\r</reduce>
        </transformer>

        <!-- Remove CSS -->
        <transformer class="${importer}.handler.transformer.impl.ReplaceTransformer" caseSensitive="false">
          <replace>
            <fromValue>class=".*?"</fromValue>
             <toValue></toValue>
          </replace>
        </transformer>
        <transformer class="${importer}.handler.transformer.impl.StripBetweenTransformer" inclusive="true" >
          <stripBetween>
            <start>&lt;style.*?&gt;</start>
            <end>&lt;/style&gt;</end>
          </stripBetween>
          <stripBetween>
            <start>&lt;script.*?&gt;</start>
            <end>&lt;/script&gt;</end>
          </stripBetween>
        </transformer>

        <tagger class="${importer}.handler.tagger.impl.DocumentLengthTagger" field="document.size.postparse" />

        <!-- Reject small documents (<100 Bytes)-->
                <filter class="${importer}.handler.filter.impl.NumericMetadataFilter" onMatch="exclude" field="document.size.postparse" >
                <condition operator="lt" number="100" />
                </filter>

                <!-- Detect the language and tag it -->
        <tagger class="${importer}.handler.tagger.impl.LanguageTagger" shortText="false" keepProbabilities="false" fallbackLanguage="" />

                <!-- Lower-case some document meta fields -->
        <tagger class="${importer}.handler.tagger.impl.CharacterCaseTagger">
            <characterCase fieldName="keywords" type="lower" applyTo="field" />
            <characterCase fieldName="description" type="lower" applyTo="field" />
              </tagger>

        <!-- Unless you configured Solr to accept ANY fields, it will fail
             when you try to add documents. Keep only the metadata fields provided, delete all other ones. -->
        <tagger class="${importer}.handler.tagger.impl.KeepOnlyTagger">
          <fields>content, title, keywords, description, tags, collector.referrer-reference, collector.depth, document.reference, document.language, document.size.preparse, document.size.postparse</fields>
        </tagger>

                <!-- Log fields for debugging -->
        <!-- <tagger class="${importer}.handler.tagger.impl.DebugTagger" logFields="Keywords, description" logContent="true" logLevel="WARN" />-->

      </postParseHandlers>

essiembre commented 8 years ago

I was able to reproduce your issue with the config and fix it. Please try this new snapshot release. Hopefully that's the real deal this time. :-)

comschmid commented 8 years ago

Thanks, now it works as intended!

essiembre commented 8 years ago

Great!

Norconex / crawlers

Allow changing character case of field names #166