Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Repeatable crawler runs #557

Closed ciroppina closed 5 years ago

ciroppina commented 5 years ago

Dear Sirs,

I want to configure my in order to crawl starturls every 30 minutes. I tried using both the tags and the tag, but when the crawler job ends, the connector terminates. I would expect it should remain in a "waiting to restart..." state, but maybe I am wrong with my expectations What is the right way to configure the behavior I need?

Here U are my crawler configuration:

<crawler id="canale_astralspa">
<!-- UNCOMMENT START-URLS TO MAKE THIS CRAWLER WORKING -->

    <!-- Requires at least one start URL (or urlsFile). 
    Optionally limit crawling to same protocol/domain/port as 
    start URLs. 
    -->
    <startURLs stayOnDomain="true" stayOnPort="false" stayOnProtocol="false">
        <url>http://www.astralspa.it/?page_id=1787</url>
    </startURLs>

    <userAgent>Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0</userAgent>

    <!-- Specify a crawler default directory where to generate files. -->
    <workDir>./tasks-output/astralspa</workDir>

    <!-- Put a maximum depth to avoid infinite crawling (default: -1). -->
    <maxDepth>2</maxDepth>

    <robotsTxt ignore="true"/>
    <robotsMeta ignore="true"/>

    <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
    <sitemapResolverFactory ignore="true" />

    <!-- Be as nice as you can with sites you crawl. -->
    <delay default="2000" ignoreRobotsCrawlDelay="true" class="$delayResolver"
           scope="crawler" >
        <schedule dayOfWeek="from Monday to Sunday" 
                  dayOfMonth="from 1 to 31"
                  time="from 08:30 to 20:30">1800</schedule>
    </delay>

     <!-- recrawlableResolver class="$recrawlResolver" sitemapSupport="never" >
         <minFrequency applyTo="reference" value="1800000">.*astralspa.*</minFrequency>
    </recrawlableResolver -->

    <!-- keep downloaded pages/files to your filesystem './astralspa/downloads/' folder -->
    <keepDownloads>false</keepDownloads>

    <!-- Optionally filter URL BEFORE any download. Classes must implement 
     com.norconex.collector.core.filter.IReferenceFilter, like the following examples.
    -->
    <referenceFilters>
        <!-- exclude extension filter -->
        <filter class="$filterExtension" onMatch="exclude" >
            jpg,gif,png,ico,bmp,tiff,svg,jpeg,css,js,less,json</filter>
        <!-- regex filters -->
        <filter class="$filterRegexRef">.*astralspa.*</filter>
        <filter class="$filterRegexRef">.*regione.lazio.it/binary/.*</filter>
        <filter class="$filterRegexRef" onMatch="exclude">.*image.*|.*gallery.*|.*json.*|.*ical=.*|.*/css/.*</filter>
    </referenceFilters>

    <!-- Document importing -->
    <importer>
        <postParseHandlers>

          <!-- levels: FATAL|ERROR|WARN|INFO|DEBUG|TRACE -->
          <tagger class="com.norconex.importer.handler.tagger.impl.DebugTagger" logLevel="DEBUG"/>

            <!-- If your target repository does not support arbitrary fields,
            make sure you only keep the fields you need. -->
            <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
              <fields>title,keywords,description,document.reference,document.contentType,collector.referenced-urls</fields>
            </tagger>
            <!-- adds a constant metadata field: FromTask -->
            <tagger class="com.norconex.importer.handler.tagger.impl.ConstantTagger">
                <constant name="FromTask">astralspa_task</constant>
            </tagger>
        </postParseHandlers>
    </importer>
</crawler>
essiembre commented 5 years ago

The HTTP Collector does not implement its own scheduler. Instead, it relies on the operating system scheduler like crontab for Linux/Unix, or Windows Task Scheduler for Windows, or any other scheduling process your organization may have adopted.

ciroppina commented 5 years ago

many thanks mr Pascal Essiembre - issue closed