Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Handling depth restriction depends on type #280

Closed popthink closed 8 years ago

popthink commented 8 years ago

Hello :) I'm sorry for my poor English skill.

Thank you for your kindness.

I'm trying to make a crawler which handles depth with url type.

For example.

'iframe url' => handle as the same depth with its parent(referrer) 'a link' =>handle as +1 depth.

so.. It will make the Crawler not stop for collecting iframe urls by depth restriction.

I read APIs and.. I found some points, fetcher, linkextractor. But It's not clear to do that.

Could you give me some advice?

Thank you :)

essiembre commented 8 years ago

The depth represents the number of "jumps" from one URL to another. Frame URLs are distinct documents so they will always be one level deeper than their containing parent page (unless they are also linked from elsewhere less deep). So if they are skipped because of your maxDepth setting, increase it. If that causes other documents to be crawled you do not want, I suggest you rely less on the maxDepth and add other filtering rules instead to exclude pages you do not want. Make sense?

popthink commented 8 years ago

Okay.. Then I should make a class for this and increase maxDepth.

`

`

class MyClass{

  void someFilterMethod(Doc doc, Properties metadata){
     if( metadata.get(referrer.tag) == "iframe.src"  )
        // do something
     else //do something else.
  }
}

Right? Thank you for your advice :)

essiembre commented 8 years ago

Why don't you just increase the maxDepth? Are you worried you will get too many documents you do not want? Do you have a specific example of documents you do not want? Maybe you can share your config and explain what you want to happen. Chances are regular filter could do the trick (whether a reference filter, metadata filter, importer filter, etc.).

popthink commented 8 years ago

Yes, I'm worried about getting too many documents.

So I just want to handle iframe urls on the leaf and urls within maxDepth.

Max Depth : 2   or   3(to crawl)

Home(Depth 0) -- A href url (Depth 1)  -- A href url (Depth 2) -- A href url (Depth 3, Rejected | Filtered)
                                                               |--iframe url (Depth 3, Accpeted)

This is what I want to do.

So I thought that I should make maxDepth => 3 and filters for accepting iframe.src(referrer tag)

essiembre commented 8 years ago

Short of writing your own filter class, you can try using a ScriptFilter. The following should work (change "1" to whatever depth you want):

      <importer>
        <preParseHandlers>
          <filter class="com.norconex.importer.handler.filter.impl.ScriptFilter">
            <script><![CDATA[
              isFromIFrame = metadata.getString('collector.referrer-link-tag') == 'iframe.src';
              depth = metadata.getInt('collector.depth');
              /*return*/ (depth < 1 || isFromIFrame);
            ]]></script>
          </filter>
        </preParseHandlers>
      </importer>

For it to work, you have to keep referrer data when you crawl. If you are using the latest snapshot those are always stored so it should just work. Otherwise, you may have to add this first to your crawler config:

      <linkExtractors>
        <extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor"
          keepReferrerData="true" />
      </linkExtractors>   

In any case, this solution will only work if your iframe are always encountered at the same depth. If the depth may vary, I would suggest you try to rely on other patterns to exclude/include just what you want.

essiembre commented 8 years ago

Did the suggested approach work for you? Can we close this issue?

popthink commented 8 years ago

Oh, Sorry for late.

I tried and checked that.

Now It can crawl A-links( < maxDepth) and iframes( <= maxDepth) :+1:

Thank you very much, really.

essiembre commented 8 years ago

Glad you have a working solution. Thanks for confirming.