Closed LyesHocine closed 9 years ago
Quick question: Did you copy the .jar files from the idol committer as per described in the installation documentation
This committer is a library that you must include in another product classpath (along with required dependencies). For use with a Norconex Collector, follow these simple steps:
Hi, there is no problem with IDOL. every thing work fine with html webpages. my problem is with RSS feeds. is there any special configuration for it?
Thanks.
I reformatted your message so we can clearly see the XML tags now.
I was able to try your config. The RSS feed gets parsed and the text is committed. Am I assuming right when you say "it is not working", you mean you would like individual URLs in the <link>
tags to be followed and crawled? Or would you want to split the <item>
tags in the RSS feed and create a new document for each one? Both are possible, I'll play with it when I have a chance and get back to you.
If you get the latest snapshot, you can tell the HtmlLinkExtractor which tags are holding the URLs (before you had to specify both a tag and attribute).
So one way to crawl all pages in a RSS feed, is to first register the HtmlLinkExtractor
with its default settings (to handle HTML page), and add one specific to your RSS feed, like this (in your <crawler ...>
):
<linkExtractors>
<extractor class="com.norconex.collector.http.url.impl.HtmlLinkExtractor" />
<extractor class="com.norconex.collector.http.url.impl.HtmlLinkExtractor">
<contentTypes>
application/xml
</contentTypes>
<tags>
<tag name="link" />
</tags>
</extractor>
</linkExtractors>
The above will extract URLs out of <link>
tags and crawl them. If you do not want to store the RSS page itself, you can filter it out in the <importer>
module.
If it is something else you are after, please elaborate.
Thanks a lot for your help,
I tried what you gave me but the result is that the page is crawled but not the links.
what to do to get link tags to be crawled?
this is my crawler config.
thanks again.
Check http://www.norconex.com/how-to-crawl-facebook/ and create a LinkExtractor for your RSS.
Hi OkkeKlein, Thanks for responding, but like essiembre suggested: there is already an HtmlLinkExtractor that should do the job as explained in this : http://www.norconex.com/collectors/collector-http/latest/apidocs/com/norconex/collector/http/url/impl/HtmlLinkExtractor.html
Sorry i closed by mistake
Ah yes, a new feature. But this one is looking for <link>
tag not the <url>
tag that your example is using.
but if i look to RSS file (XML type), there is a tag called "link" where i can find the link that i want to follow.
@LyesHocine, link extraction is a task performed when crawling by the HTTP Collector (not the importer). Are you sure you are using the latest snapshot release of the HTTP Connector? The code sample I provided in this thread was implemented recently. Try with the latest, and if you still have issues with link extraction, please open a new issue in the HTTP Collector project.
Hi i want to collect pages from rss feed this is my crawler but no result please help me