Closed popthink closed 8 years ago
The depth represents the number of "jumps" from one URL to another. Frame URLs are distinct documents so they will always be one level deeper than their containing parent page (unless they are also linked from elsewhere less deep). So if they are skipped because of your maxDepth setting, increase it. If that causes other documents to be crawled you do not want, I suggest you rely less on the maxDepth and add other filtering rules instead to exclude pages you do not want. Make sense?
Okay.. Then I should make a class for this and increase maxDepth.
`
`
class MyClass{
void someFilterMethod(Doc doc, Properties metadata){
if( metadata.get(referrer.tag) == "iframe.src" )
// do something
else //do something else.
}
}
Right? Thank you for your advice :)
Why don't you just increase the maxDepth? Are you worried you will get too many documents you do not want? Do you have a specific example of documents you do not want? Maybe you can share your config and explain what you want to happen. Chances are regular filter could do the trick (whether a reference filter, metadata filter, importer filter, etc.).
Yes, I'm worried about getting too many documents.
So I just want to handle iframe urls on the leaf and urls within maxDepth.
Max Depth : 2 or 3(to crawl)
Home(Depth 0) -- A href url (Depth 1) -- A href url (Depth 2) -- A href url (Depth 3, Rejected | Filtered)
|--iframe url (Depth 3, Accpeted)
This is what I want to do.
So I thought that I should make maxDepth => 3 and filters for accepting iframe.src(referrer tag)
Short of writing your own filter class, you can try using a ScriptFilter. The following should work (change "1" to whatever depth you want):
<importer>
<preParseHandlers>
<filter class="com.norconex.importer.handler.filter.impl.ScriptFilter">
<script><![CDATA[
isFromIFrame = metadata.getString('collector.referrer-link-tag') == 'iframe.src';
depth = metadata.getInt('collector.depth');
/*return*/ (depth < 1 || isFromIFrame);
]]></script>
</filter>
</preParseHandlers>
</importer>
For it to work, you have to keep referrer data when you crawl. If you are using the latest snapshot those are always stored so it should just work. Otherwise, you may have to add this first to your crawler config:
<linkExtractors>
<extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor"
keepReferrerData="true" />
</linkExtractors>
In any case, this solution will only work if your iframe are always encountered at the same depth. If the depth may vary, I would suggest you try to rely on other patterns to exclude/include just what you want.
Did the suggested approach work for you? Can we close this issue?
Oh, Sorry for late.
I tried and checked that.
Now It can crawl A-links( < maxDepth) and iframes( <= maxDepth) :+1:
Thank you very much, really.
Glad you have a working solution. Thanks for confirming.
Hello :) I'm sorry for my poor English skill.
Thank you for your kindness.
I'm trying to make a crawler which handles depth with url type.
For example.
'iframe url' => handle as the same depth with its parent(referrer) 'a link' =>handle as +1 depth.
so.. It will make the Crawler not stop for collecting iframe urls by depth restriction.
I read APIs and.. I found some points, fetcher, linkextractor. But It's not clear to do that.
Could you give me some advice?
Thank you :)