Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Allow changing character case of field names #166

Closed essiembre closed 8 years ago

essiembre commented 8 years ago

Request created from @comschmid comment in issue #163.

Allows to change the character case of field names, like CharacterCaseTagger does for field values.

essiembre commented 8 years ago

This feature is included in the latest snapshot. The CharacterCaseTagger now supports a new attribute called applyTo which takes one of the following values: value (default), field, or both. An example:

<tagger class="com.norconex.importer.handler.tagger.impl.CharacterCaseTagger">
      <characterCase type="upper" fieldName="myField" applyTo="field" />
</tagger>

This will result in every case-permutations of "myField" to be internally stored (and committed) as "MYFIELD". Let me know if that works for you.

comschmid commented 8 years ago

Unfortunately, it does not change the case of the field name with the following configuration just before the KeepOnlyTagger:

<!-- Lower-case some document meta fields -->
<tagger class="${importer}.handler.tagger.impl.CharacterCaseTagger">
    <characterCase fieldName="keywords" type="lower" applyTo="field" />
    <characterCase fieldName="description" type="lower" applyTo="field" />
</tagger>

I double-checked it with the DebugTagger, that still shows the field name 'Keywords' with a corresponding value.

essiembre commented 8 years ago

Did you add that as a pre-handler or post-handler? Because I suspect "Keywords" is added for you AFTER this tagger did his case conversion. Make sure you put it last.

comschmid commented 8 years ago

Yes, I did, but even if it is at the very end it does not work and I think I know why, because when setting fieldName="Keywords" (with the actual field case) it works:

In CharacterCaseTagger.java on line 101 you iterate over the field names with the exact case as configured with the attribute fieldName. Therefore, "keywords" will never return a CaseChangeDetails one line further, because it doesn't exist in this document.

Just an idea: probably the easiest would be to set a case flag in the meta field extractors/parsers (where the field name gets created) but you see the big picture better than me.

essiembre commented 8 years ago

Can you attach the importer portion of your configuration? I will look into this.

comschmid commented 8 years ago

Thanks, following the importer part:

     <importer>
        <!-- 
        <tempDir>defined/in/code</tempDir>
        <maxFileCacheSize></maxFileCacheSize>
        <maxFilePoolCacheSize></maxFilePoolCacheSize>
        <parseErrorsSaveDir>defined/in/code</parseErrorsSaveDir>
         -->

        <preParseHandlers>
              <tagger class="${importer}.handler.tagger.impl.DocumentLengthTagger" field="document.size.preparse" />

            <!-- These tags can be mixed, in the desired order of execution. -->
                <!-- 
            <tagger class="..." />
            <transformer class="..." />
            <filter class="..." />
            <splitter class="..." />   -->     
        </preParseHandlers>

        <!-- <documentParserFactory class="..." /> -->
      <postParseHandlers>
        <!-- These tags can be mixed, in the desired order of execution. -->

        <!-- follow HTML meta-equiv redirects without indexing original page -->
        <filter class="${importer}.handler.filter.impl.RegexMetadataFilter" onMatch="exclude" property="refresh">.*</filter>

        <!-- Collapse spaces and line feeds -->
        <transformer class="${importer}.handler.transformer.impl.ReduceConsecutivesTransformer" caseSensitive="true">
          <reduce>\s</reduce>
          <reduce>\n</reduce>
          <reduce>\r</reduce>
          <reduce>\t</reduce>
          <reduce>\n\r</reduce>
          <reduce>\r\n</reduce>
          <reduce>\s\n</reduce>
          <reduce>\s\r</reduce>
          <reduce>\s\r\n</reduce>
          <reduce>\s\n\r</reduce>
        </transformer>

        <!-- Remove CSS -->
        <transformer class="${importer}.handler.transformer.impl.ReplaceTransformer" caseSensitive="false">
          <replace>
            <fromValue>class=".*?"</fromValue>
             <toValue></toValue>
          </replace>
        </transformer>
        <transformer class="${importer}.handler.transformer.impl.StripBetweenTransformer" inclusive="true" >
          <stripBetween>
            <start>&lt;style.*?&gt;</start>
            <end>&lt;/style&gt;</end>
          </stripBetween>
          <stripBetween>
            <start>&lt;script.*?&gt;</start>
            <end>&lt;/script&gt;</end>
          </stripBetween>
        </transformer>

        <tagger class="${importer}.handler.tagger.impl.DocumentLengthTagger" field="document.size.postparse" />

        <!-- Reject small documents (<100 Bytes)-->
                <filter class="${importer}.handler.filter.impl.NumericMetadataFilter" onMatch="exclude" field="document.size.postparse" >
                <condition operator="lt" number="100" />
                </filter>

                <!-- Detect the language and tag it -->
        <tagger class="${importer}.handler.tagger.impl.LanguageTagger" shortText="false" keepProbabilities="false" fallbackLanguage="" />

                <!-- Lower-case some document meta fields -->
        <tagger class="${importer}.handler.tagger.impl.CharacterCaseTagger">
            <characterCase fieldName="keywords" type="lower" applyTo="field" />
            <characterCase fieldName="description" type="lower" applyTo="field" />
              </tagger>

        <!-- Unless you configured Solr to accept ANY fields, it will fail
             when you try to add documents. Keep only the metadata fields provided, delete all other ones. -->
        <tagger class="${importer}.handler.tagger.impl.KeepOnlyTagger">
          <fields>content, title, keywords, description, tags, collector.referrer-reference, collector.depth, document.reference, document.language, document.size.preparse, document.size.postparse</fields>
        </tagger>

                <!-- Log fields for debugging -->
        <!-- <tagger class="${importer}.handler.tagger.impl.DebugTagger" logFields="Keywords, description" logContent="true" logLevel="WARN" />-->

      </postParseHandlers>
essiembre commented 8 years ago

I was able to reproduce your issue with the config and fix it. Please try this new snapshot release. Hopefully that's the real deal this time. :-)

comschmid commented 8 years ago

Thanks, now it works as intended!

essiembre commented 8 years ago

Great!