Closed kodo651 closed 6 years ago
My solution with ScriptTagger
in JavaScript:
<tagger class="com.norconex.importer.handler.tagger.impl.ScriptTagger" >
<script><![CDATA[
var URI = Java.type("java.net.URI");
var host = (new URI(reference)).getHost();
var parts = host.split(".");
/* If the last portion of the host begins with a number,
it's an IPv4 address and there is no domain to be found. */
if(!parts[parts.length-1].match(/^[0-9]/)) {
var domain = parts[parts.length-2] + "." + parts[parts.length-1];
metadata.addString("domain", domain);
}
]]></script>
</tagger>
However, this solution will not work with domains that have more than two parts, such as example.co.uk
. To really tackle this properly, your solution would have to be aware of all public suffixes.
@kodo651, its name does not give it away, but TextPatternTagger is applied on the "body" of a document. Besides the useful scripting suggestion by @ronjakoi, you can have a look at ReplaceTagger. Something like this should work:
<tagger class="com.norconex.importer.handler.tagger.impl.ReplaceTagger">
<replace fromField="uri" toField="domain" regex="true">
<fromValue>your (regex)</fromValue>
<toValue>$1</toValue>
</replace>
</tagger>
Please confirm.
A big thank you to both of you. Right now I'm leaning towards the "ReplaceTagger" however I fail to get it working. This is what I'm trying to apply:
<tagger class="com.norconex.importer.handler.tagger.impl.ReplaceTagger">
<replace fromField="uri" toField="domain" regex="true">
<fromValue>^(?:https?:\/\/)?(?:[^@\n]+@)?(?:www\.)?([^:\/\n]+)</fromValue>
<toValue>$1</toValue>
</replace>
</tagger>
I've done some adjustments now and have had some progress. However the ReplaceTagger behaves a little unexpected. In some cases the regex "almost works" and sometimes it doesn't. I've tested some of the URI:s using the online tool here: https://regex101.com/ , and using the regex provided in the configuration all the URI:s are possible to retrieve the correct domain. Let me provide some examples:
uri: http://www.flashback.se/media/ domain: (empty!) regex101-site: flashback.se (correct!)
uri: https://polisen.se/rom/ domain: (empty!) regex101-site: polisen.se (correct!)
uri: http://www.flashback.se/artikel/3905/saudiarabien-mutar-utlandska-medier domain: flashback.se/artikel/3905/saudiarabien-mutar-utlandska-medier (error!) regex101-site: flashback.se (correct!)
uri: https://polisen.se/om-polisen/polisens-arbete/demonstrationer/ domain: (empty!) regex101-site: polisen.se (correct!)
From what I've seen none of the "https" URI:s works but they check out using the regex-tool.
Seems to be a problem with the regex-interpretation by the component. Do I have to escape the regex in any way?
Not sure if it should be considered an issue, but try adding .*
at the end of your regex. Works for me when I do so.
Hello!
I really appreciate your efforts here in helping me out :) I tried your suggestion but it didn't help... Would you mind posting your working code fragment here as it looks in your config-file? Many thanks!
The following works for me:
<importer>
<postParseHandlers>
<tagger class="com.norconex.importer.handler.tagger.impl.ReplaceTagger">
<replace fromField="uri" toField="domain" regex="true">
<fromValue>^(?:https?:\/\/)?(?:[^@\n]+@)?(?:www\.)?([^:\/\n]+).*</fromValue>
<toValue>$1</toValue>
</replace>
</tagger>
</postParseHandlers>
</importer>
I'm so sorry! The configuration worked as expected - I had forgotten to do some "housekeeping" (emptying intermediate results before re-crawling) but now when I've done that it looks perfect!
Many, many thanks for your help!
Cheers
No problem. Glad it works for you now.
Hi!
I've been struggling to use the TextPatternTagger to extract the domain+subdomain (x.y.z -> y.z). I have a field, uri, which essentially is equivalent to "document.reference". I would like to apply a regex in order to extract y.z from the uri-field and then put it into a new meta-data field, "domain". Is there an easy way to accomplish this?
Many, many thanks in advance!
Ps No need to provide a regex as I have it already - it's the settings I'm looking for :) Ds