Closed aleha84 closed 7 years ago
Do you always have an extension in your URLs? If so, you can use the ReplaceTagger, to be added in the <importer>
section as a pre-parse handler, or post-parse handler. Example:
<tagger class="com.norconex.importer.handler.tagger.impl.ReplaceTagger">
<replace fromField="document.reference" toField="extension" regex="true">
<fromValue><![CDATA[.*\.([^?#]*)(\?|\#|$).*]]></fromValue>
<toValue>$1</toValue>
</replace>
</tagger>
You may have to play with the regular expression to match what URL how you want.
If you do not have extensions to every URLs, you may instead rely on document.contentType
, which should always be added.
Please confirm.
Actually need both. If no extension presented in URL need some stub, like 'NO_EXTENSION' or something else. document.contentType - helpful, but doesn't allow to differ extensions. ReplaceTagger - helpful too, but if no extension, it fails.
If you want to have a default value, I can think of one way to do it without programming. It would be to configure a ConstantTagger before the ReplaceTagger
, setting your default value. Then you have your replace tagger (modify it to only match when there is an extension). What will happen is the extension value will get added as a new value to the extension
field. Then you can keep the last value only with ForceSingleValueTagger. Example:
<tagger class="com.norconex.importer.handler.tagger.impl.ConstantTagger">
<constant name="extension">NO_EXTENSION</constant>
</tagger>
<!-- Replace tagger here -->
<tagger class="com.norconex.importer.handler.tagger.impl.ForceSingleValueTagger">
<singleValue field="extension" action="keepLast"/>
</tagger>
If you wish something simpler, we can turn this into a feature request to have a tagger that sets a default value when none is found, or have the ReplaceTagger support a default value when there is no match. The ScriptTagger may be helpful also. It is coding, but no compiling required (can be done in the config).
By now, i'm using elastic script field to calculate at runtime by using regex needed data. It is really fast. Thx.
Glad you managed to get what you want.
FYI, the latest snapshot now has a new attribute called onConflict
added to the ConstantTagger. It allows you to specify what to do if there is already a value. Possible options are: add
(default), replace
, and noop
(do nothing). So to add NO_EXTENSION only if a field does not already have a value, you can now do this:
<tagger class="com.norconex.importer.handler.tagger.impl.ConstantTagger"
onConflict="noop">
<constant name="extension">NO_EXTENSION</constant>
</tagger>
I need additional field in resulting object which contains url extension. For example ".Html", ".Php", ".aspx", etc. How to do it without additional programming by using only configuration options?