Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Extract url extension #328

Closed aleha84 closed 7 years ago

aleha84 commented 7 years ago

I need additional field in resulting object which contains url extension. For example ".Html", ".Php", ".aspx", etc. How to do it without additional programming by using only configuration options?

essiembre commented 7 years ago

Do you always have an extension in your URLs? If so, you can use the ReplaceTagger, to be added in the <importer> section as a pre-parse handler, or post-parse handler. Example:

<tagger class="com.norconex.importer.handler.tagger.impl.ReplaceTagger">
      <replace fromField="document.reference" toField="extension" regex="true">
          <fromValue><![CDATA[.*\.([^?#]*)(\?|\#|$).*]]></fromValue>
          <toValue>$1</toValue>
      </replace>
  </tagger>

You may have to play with the regular expression to match what URL how you want.

If you do not have extensions to every URLs, you may instead rely on document.contentType, which should always be added.

Please confirm.

aleha84 commented 7 years ago

Actually need both. If no extension presented in URL need some stub, like 'NO_EXTENSION' or something else. document.contentType - helpful, but doesn't allow to differ extensions. ReplaceTagger - helpful too, but if no extension, it fails.

essiembre commented 7 years ago

If you want to have a default value, I can think of one way to do it without programming. It would be to configure a ConstantTagger before the ReplaceTagger, setting your default value. Then you have your replace tagger (modify it to only match when there is an extension). What will happen is the extension value will get added as a new value to the extension field. Then you can keep the last value only with ForceSingleValueTagger. Example:

  <tagger class="com.norconex.importer.handler.tagger.impl.ConstantTagger">
      <constant name="extension">NO_EXTENSION</constant>
  </tagger>
  <!-- Replace tagger here -->
  <tagger class="com.norconex.importer.handler.tagger.impl.ForceSingleValueTagger">
      <singleValue field="extension" action="keepLast"/>
  </tagger>

If you wish something simpler, we can turn this into a feature request to have a tagger that sets a default value when none is found, or have the ReplaceTagger support a default value when there is no match. The ScriptTagger may be helpful also. It is coding, but no compiling required (can be done in the config).

aleha84 commented 7 years ago

By now, i'm using elastic script field to calculate at runtime by using regex needed data. It is really fast. Thx.

essiembre commented 7 years ago

Glad you managed to get what you want.

essiembre commented 7 years ago

FYI, the latest snapshot now has a new attribute called onConflict added to the ConstantTagger. It allows you to specify what to do if there is already a value. Possible options are: add (default), replace, and noop (do nothing). So to add NO_EXTENSION only if a field does not already have a value, you can now do this:

<tagger class="com.norconex.importer.handler.tagger.impl.ConstantTagger"
        onConflict="noop">
    <constant name="extension">NO_EXTENSION</constant>
</tagger>