Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
32 stars 23 forks source link

Import RSS Feed #14

Closed LyesHocine closed 9 years ago

LyesHocine commented 9 years ago

Hi i want to collect pages from rss feed this is my crawler but no result please help me

<httpcollector id="IDOL HTTP Collector">

  <!-- Decide where to store generated files. -->
  <progressDir>./examples-output/minimum/progress</progressDir>
  <logsDir>./examples-output/minimum/logs</logsDir>

  <crawlerDefaults>  
    <numThreads>4</numThreads>
    <maxDepth>1</maxDepth>
    <maxDocuments>-1</maxDocuments>
    <keepDownloads>false</keepDownloads>
    <orphansStrategy>IGNORE</orphansStrategy>    
    <referenceFilters>
      <filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter" onMatch="exclude" caseSensitive="false" >
        jpg,gif,png,ico,css,js</filter>    
    </referenceFilters>          
    <importer>
        <postParseHandlers>
          <!-- If your target repository does not support arbitrary fields,
               make sure you only keep the fields you need. -->
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>title,keywords,description,document.reference</fields>
          </tagger>
        </postParseHandlers>
     </importer> 

    <committer class="com.norconex.committer.idol.IdolCommitter">

        <!-- To commit documents to IDOL or DIH: -->
        <databaseName>Webcontent</databaseName>

        <!-- To commit documents to CFS: -->
        <host>127.0.0.1</host>
        <indexPort>9001</indexPort>
        <dreAddDataParams>
            <param name="Job">Norconex Job</param>
        </dreAddDataParams>
    </committer>
  </crawlerDefaults>

  <crawlers>    
    <crawler id="Rss Ou Va Algerie">
    <startURLs>
        <url>http://www.lefigaro.fr/rss/figaro_politique.xml</url>
      </startURLs>
      <workDir>./examples-output/Ou_Va_Algerie</workDir>     
      <sitemap ignore="true" /> 
      <delay default="5000" />    
      <referenceFilters>
        <filter 
            class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
            onMatch="include" >
        http://www.lefigaro.fr/.*
        </filter>
      </referenceFilters>
    </crawler>
  </crawlers>

</httpcollector>
martinfou commented 9 years ago

Quick question: Did you copy the .jar files from the idol committer as per described in the installation documentation

Installation

This committer is a library that you must include in another product classpath (along with required dependencies). For use with a Norconex Collector, follow these simple steps:

LyesHocine commented 9 years ago

Hi, there is no problem with IDOL. every thing work fine with html webpages. my problem is with RSS feeds. is there any special configuration for it?

Thanks.

essiembre commented 9 years ago

I reformatted your message so we can clearly see the XML tags now.

I was able to try your config. The RSS feed gets parsed and the text is committed. Am I assuming right when you say "it is not working", you mean you would like individual URLs in the <link> tags to be followed and crawled? Or would you want to split the <item> tags in the RSS feed and create a new document for each one? Both are possible, I'll play with it when I have a chance and get back to you.

essiembre commented 9 years ago

If you get the latest snapshot, you can tell the HtmlLinkExtractor which tags are holding the URLs (before you had to specify both a tag and attribute).

So one way to crawl all pages in a RSS feed, is to first register the HtmlLinkExtractor with its default settings (to handle HTML page), and add one specific to your RSS feed, like this (in your <crawler ...>):

<linkExtractors>
    <extractor class="com.norconex.collector.http.url.impl.HtmlLinkExtractor" />
    <extractor class="com.norconex.collector.http.url.impl.HtmlLinkExtractor">
        <contentTypes>
         application/xml
        </contentTypes>
        <tags>
          <tag name="link" />
        </tags>
    </extractor>      
</linkExtractors>

The above will extract URLs out of <link> tags and crawl them. If you do not want to store the RSS page itself, you can filter it out in the <importer> module.

If it is something else you are after, please elaborate.

LyesHocine commented 9 years ago

Thanks a lot for your help,

I tried what you gave me but the result is that the page is crawled but not the links.

what to do to get link tags to be crawled?

this is my crawler config.

thanks again.

./examples-output/RSS/progress ./examples-output/RSS/logs ``` http://www.elwatan.com/actualite/rss.xml ./examples-output/RSS 1 application/xml title,keywords,description,document.reference ./examples-output/RSS/crawledFiles ```
OkkeKlein commented 9 years ago

Check http://www.norconex.com/how-to-crawl-facebook/ and create a LinkExtractor for your RSS.

LyesHocine commented 9 years ago

Hi OkkeKlein, Thanks for responding, but like essiembre suggested: there is already an HtmlLinkExtractor that should do the job as explained in this : http://www.norconex.com/collectors/collector-http/latest/apidocs/com/norconex/collector/http/url/impl/HtmlLinkExtractor.html

LyesHocine commented 9 years ago

Sorry i closed by mistake

OkkeKlein commented 9 years ago

Ah yes, a new feature. But this one is looking for <link> tag not the <url> tag that your example is using.

LyesHocine commented 9 years ago

but if i look to RSS file (XML type), there is a tag called "link" where i can find the link that i want to follow.

essiembre commented 9 years ago

@LyesHocine, link extraction is a task performed when crawling by the HTTP Collector (not the importer). Are you sure you are using the latest snapshot release of the HTTP Connector? The code sample I provided in this thread was implemented recently. Try with the latest, and if you still have issues with link extraction, please open a new issue in the HTTP Collector project.