Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

How to use TextPatternTagger to extract domain.subdomain into new field #493

Closed kodo651 closed 6 years ago

kodo651 commented 6 years ago

Hi!

I've been struggling to use the TextPatternTagger to extract the domain+subdomain (x.y.z -> y.z). I have a field, uri, which essentially is equivalent to "document.reference". I would like to apply a regex in order to extract y.z from the uri-field and then put it into a new meta-data field, "domain". Is there an easy way to accomplish this?

Many, many thanks in advance!

Ps No need to provide a regex as I have it already - it's the settings I'm looking for :) Ds

ronjakoi commented 6 years ago

My solution with ScriptTagger in JavaScript:

<tagger class="com.norconex.importer.handler.tagger.impl.ScriptTagger" >
    <script><![CDATA[
        var URI = Java.type("java.net.URI");
        var host = (new URI(reference)).getHost();

        var parts = host.split(".");
        /* If the last portion of the host begins with a number,
        it's an IPv4 address and there is no domain to be found. */
        if(!parts[parts.length-1].match(/^[0-9]/)) {
            var domain = parts[parts.length-2] + "." + parts[parts.length-1];
            metadata.addString("domain", domain);
        }
    ]]></script>
</tagger>

However, this solution will not work with domains that have more than two parts, such as example.co.uk. To really tackle this properly, your solution would have to be aware of all public suffixes.

essiembre commented 6 years ago

@kodo651, its name does not give it away, but TextPatternTagger is applied on the "body" of a document. Besides the useful scripting suggestion by @ronjakoi, you can have a look at ReplaceTagger. Something like this should work:

  <tagger class="com.norconex.importer.handler.tagger.impl.ReplaceTagger">
      <replace fromField="uri" toField="domain" regex="true">
          <fromValue>your (regex)</fromValue>
          <toValue>$1</toValue>
      </replace>
  </tagger>

Please confirm.

kodo651 commented 6 years ago

A big thank you to both of you. Right now I'm leaning towards the "ReplaceTagger" however I fail to get it working. This is what I'm trying to apply:

          <tagger class="com.norconex.importer.handler.tagger.impl.ReplaceTagger">
            <replace fromField="uri" toField="domain" regex="true">
              <fromValue>^(?:https?:\/\/)?(?:[^@\n]+@)?(?:www\.)?([^:\/\n]+)</fromValue>
              <toValue>$1</toValue>
            </replace>
          </tagger>

I've done some adjustments now and have had some progress. However the ReplaceTagger behaves a little unexpected. In some cases the regex "almost works" and sometimes it doesn't. I've tested some of the URI:s using the online tool here: https://regex101.com/ , and using the regex provided in the configuration all the URI:s are possible to retrieve the correct domain. Let me provide some examples:

uri: http://www.flashback.se/media/ domain: (empty!) regex101-site: flashback.se (correct!)

uri: https://polisen.se/rom/ domain: (empty!) regex101-site: polisen.se (correct!)

uri: http://www.flashback.se/artikel/3905/saudiarabien-mutar-utlandska-medier domain: flashback.se/artikel/3905/saudiarabien-mutar-utlandska-medier (error!) regex101-site: flashback.se (correct!)

uri: https://polisen.se/om-polisen/polisens-arbete/demonstrationer/ domain: (empty!) regex101-site: polisen.se (correct!)

From what I've seen none of the "https" URI:s works but they check out using the regex-tool.

Seems to be a problem with the regex-interpretation by the component. Do I have to escape the regex in any way?

essiembre commented 6 years ago

Not sure if it should be considered an issue, but try adding .* at the end of your regex. Works for me when I do so.

kodo651 commented 6 years ago

Hello!

I really appreciate your efforts here in helping me out :) I tried your suggestion but it didn't help... Would you mind posting your working code fragment here as it looks in your config-file? Many thanks!

essiembre commented 6 years ago

The following works for me:

<importer>
  <postParseHandlers>
    <tagger class="com.norconex.importer.handler.tagger.impl.ReplaceTagger">
      <replace fromField="uri" toField="domain" regex="true">
        <fromValue>^(?:https?:\/\/)?(?:[^@\n]+@)?(?:www\.)?([^:\/\n]+).*</fromValue>
        <toValue>$1</toValue>
      </replace>
    </tagger>    
  </postParseHandlers>
</importer>
kodo651 commented 6 years ago

I'm so sorry! The configuration worked as expected - I had forgotten to do some "housekeeping" (emptying intermediate results before re-crawling) but now when I've done that it looks perfect!

Many, many thanks for your help!

Cheers

essiembre commented 6 years ago

No problem. Glad it works for you now.