Repeatable crawler runs

ciroppina commented 5 years ago

Dear Sirs,

I want to configure my in order to crawl starturls every 30 minutes. I tried using both the tags and the tag, but when the crawler job ends, the connector terminates. I would expect it should remain in a "waiting to restart..." state, but maybe I am wrong with my expectations What is the right way to configure the behavior I need?

Here U are my crawler configuration:

<crawler id="canale_astralspa">
<!-- UNCOMMENT START-URLS TO MAKE THIS CRAWLER WORKING -->

    <!-- Requires at least one start URL (or urlsFile). 
    Optionally limit crawling to same protocol/domain/port as 
    start URLs. 
    -->
    <startURLs stayOnDomain="true" stayOnPort="false" stayOnProtocol="false">
        <url>http://www.astralspa.it/?page_id=1787</url>
    </startURLs>

    <userAgent>Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0</userAgent>

    <!-- Specify a crawler default directory where to generate files. -->
    <workDir>./tasks-output/astralspa</workDir>

    <!-- Put a maximum depth to avoid infinite crawling (default: -1). -->
    <maxDepth>2</maxDepth>

    <robotsTxt ignore="true"/>
    <robotsMeta ignore="true"/>

    <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
    <sitemapResolverFactory ignore="true" />

    <!-- Be as nice as you can with sites you crawl. -->
    <delay default="2000" ignoreRobotsCrawlDelay="true" class="$delayResolver"
           scope="crawler" >
        <schedule dayOfWeek="from Monday to Sunday" 
                  dayOfMonth="from 1 to 31"
                  time="from 08:30 to 20:30">1800</schedule>
    </delay>

     <!-- recrawlableResolver class="$recrawlResolver" sitemapSupport="never" >
         <minFrequency applyTo="reference" value="1800000">.*astralspa.*</minFrequency>
    </recrawlableResolver -->

    <!-- keep downloaded pages/files to your filesystem './astralspa/downloads/' folder -->
    <keepDownloads>false</keepDownloads>

    <!-- Optionally filter URL BEFORE any download. Classes must implement 
     com.norconex.collector.core.filter.IReferenceFilter, like the following examples.
    -->
    <referenceFilters>
        <!-- exclude extension filter -->
        <filter class="$filterExtension" onMatch="exclude" >
            jpg,gif,png,ico,bmp,tiff,svg,jpeg,css,js,less,json</filter>
        <!-- regex filters -->
        <filter class="$filterRegexRef">.*astralspa.*</filter>
        <filter class="$filterRegexRef">.*regione.lazio.it/binary/.*</filter>
        <filter class="$filterRegexRef" onMatch="exclude">.*image.*|.*gallery.*|.*json.*|.*ical=.*|.*/css/.*</filter>
    </referenceFilters>

    <!-- Document importing -->
    <importer>
        <postParseHandlers>

          <!-- levels: FATAL|ERROR|WARN|INFO|DEBUG|TRACE -->
          <tagger class="com.norconex.importer.handler.tagger.impl.DebugTagger" logLevel="DEBUG"/>

            <!-- If your target repository does not support arbitrary fields,
            make sure you only keep the fields you need. -->
            <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
              <fields>title,keywords,description,document.reference,document.contentType,collector.referenced-urls</fields>
            </tagger>
            <!-- adds a constant metadata field: FromTask -->
            <tagger class="com.norconex.importer.handler.tagger.impl.ConstantTagger">
                <constant name="FromTask">astralspa_task</constant>
            </tagger>
        </postParseHandlers>
    </importer>
</crawler>

essiembre commented 5 years ago

The HTTP Collector does not implement its own scheduler. Instead, it relies on the operating system scheduler like crontab for Linux/Unix, or Windows Task Scheduler for Windows, or any other scheduling process your organization may have adopted.

ciroppina commented 5 years ago

many thanks mr Pascal Essiembre - issue closed

Norconex / crawlers

Repeatable crawler runs #557