Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Relative URLs Not being correctly resolved #788

Closed sylvainroussy closed 2 years ago

sylvainroussy commented 2 years ago

Hello!

From this start URL : https://eur-lex.europa.eu/search.html?textScope0=ti&lang=en&SUBDOM_INIT=ALL_ALL&DTS_DOM=ALL&type=advanced&DTS_SUBDOM=ALL_ALL&qid=1653030108454&andText0=plastic%3F&sortOne=DD&sortOneOrder=desc

We are not be able to download relatives URLs like (for PDF docs) : CrawlerEvent.REJECTED_FILTER - ./legal-content/EN/TXT/HTML/?uri=CELEX:32021D1752 - No "include" reference filters matched.

With this pattern filter pattern : https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=.*

Of course removeDotSegments normalizer is used.

Something is bad here ? Thanks.

UtsavVanodiya7 commented 2 years ago

Make sure regex is correct and regex characters like '?' are escaped. ^https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/\?uri=.*

essiembre commented 2 years ago

Bonjour Sylvain,

Have you resolved your issue? Else can you share a config to reproduce?

Maybe: make sure your start URL is not excluded from your filter rule.

sylvainroussy commented 2 years ago

Hi Pascal,

The real pattern is : **https://eur-lex\.europa\.eu/legal-content/EN/TXT/PDF/\?uri=.* (we have a simplified form for users, transformed to real regex expression) And this pattern match the following url : https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:52022M10723** (checked on https://www.freeformatter.com/java-regex-tester.html)

What is strange it seems not all the urls are rebuilt in complete form, some examples :

1) ./legal-content/AUTO/?uri=CELEX:72019L0904IRL_202202086&qid=1653030108454&rid=2 2) https://eur-lex.europa.eu/./images/n/eurlex-logo.jpg 3) https://eur-lex.europa.eu/./browse/institutions/auditors.html

Source code for the point 1: <a id="cellar_e00cf74f-b3f8-11ec-9d96-01aa75ed71a1" href="./legal-content/AUTO/?uri=CELEX:72019L0904IRL_202202086&amp;qid=1653030108454&amp;rid=2" class="title" name="https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:72019L0904IRL_202202086">European Union (Single Use Plastics) (Amendment) Regulations 2022</a>

Source code for the point 2: <img class="pdf-logo-img" src="./images/n/eurlex-logo.jpg" alt="Back to EUR-Lex homepage" /> Source code for the point 3: <li class=""><a href="./browse/institutions/auditors.html" id="Court-of-Auditors" title="European Court of Auditors">European Court of Auditors</a></li>

Strange behaviour, isn't it?

As you know it's difficult to provide you a complete configuration because the configuration is directly Java coded.

essiembre commented 2 years ago

I was able to reproduce and I'll investigate the issue.

FYI, you can share your config using code as well:

XML xml = new XML("collector");
myCollectorInstance.getCollectorConfig().saveToXML(xml);
String printOrSaveThisString= xml.toString(2);

It is the equivalent of generating it on the command line with the command configrender.

sylvainroussy commented 2 years ago

Hi Pascal,

Here my configuration:

<collector id="3940f689-b7b9-4581-9a14-81ee2bbfdfce">
  <numThreads>2</numThreads>
  <maxDocuments>-1</maxDocuments>
  <stopOnExceptions/>
  <orphansStrategy>IGNORE</orphansStrategy>
  <dataStoreEngine class="com.norconex.collector.core.store.impl.mongodb.MongoDataStoreEngine">
    <connectionString>mongodb://*.*.*.*:****/oxway-fetchs-defaultGroupNumeric</connectionString>
  </dataStoreEngine>
  <referenceFilters>
    <filter class="com.norconex.collector.core.filter.impl.ReferenceFilter" onMatch="INCLUDE">
      <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="REGEX" partial="false" replaceAll="false">https://eur-lex\.europa\.eu/search\.html\?textScope0=ti&amp;lang=en&amp;SUBDOM_INIT=ALL_ALL&amp;DTS_DOM=ALL&amp;type=advanced&amp;DTS_SUBDOM=ALL_ALL&amp;qid=1653030108454&amp;andText0=plastic%3F&amp;sortOne=DD&amp;sortOneOrder=desc</valueMatcher>
    </filter>
    <filter class="com.norconex.collector.core.filter.impl.ReferenceFilter" onMatch="INCLUDE">
      <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="REGEX" partial="false" replaceAll="false">https://eur-lex\.europa\.eu/legal-content/EN/TXT/PDF/.*</valueMatcher>
    </filter>
    <filter class="com.norconex.collector.core.filter.impl.ReferenceFilter" onMatch="INCLUDE">
      <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="REGEX" partial="false" replaceAll="false">https://eur-lex\.europa\.eu/search\.html\?textScope0=ti&amp;lang=en&amp;SUBDOM_INIT=ALL_ALL&amp;DTS_DOM=ALL&amp;type=advanced&amp;DTS_SUBDOM=ALL_ALL&amp;qid=1653030108454&amp;andText0=plastic%3F&amp;sortOne=DD&amp;sortOneOrder=desc</valueMatcher>
    </filter>
    <filter class="com.norconex.collector.core.filter.impl.ReferenceFilter" onMatch="INCLUDE">
      <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="REGEX" partial="false" replaceAll="false">https://eur-lex\.europa\.eu/\./legal-content/EN/TXT/PDF/.*</valueMatcher>
    </filter>
    <filter caseSensitive="false" class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter" onMatch="EXCLUDE">jpg,gif,png,ico,css,js,gz,bz,jpeg</filter>
  </referenceFilters>
  <metadataFilters/>
  <documentFilters>
    <filter class="com.norconex.collector.core.filter.impl.ReferenceFilter" onMatch="INCLUDE">
      <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="REGEX" partial="false" replaceAll="false">https://eur-lex\.europa\.eu/legal-content/EN/TXT/PDF/.*</valueMatcher>
    </filter>
    <filter class="com.norconex.collector.core.filter.impl.ReferenceFilter" onMatch="INCLUDE">
      <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="REGEX" partial="false" replaceAll="false">https://eur-lex\.europa\.eu/search\.html\?textScope0=ti&amp;lang=en&amp;SUBDOM_INIT=ALL_ALL&amp;DTS_DOM=ALL&amp;type=advanced&amp;DTS_SUBDOM=ALL_ALL&amp;qid=1653030108454&amp;andText0=plastic%3F&amp;sortOne=DD&amp;sortOneOrder=desc</valueMatcher>
    </filter>
    <filter class="com.norconex.collector.core.filter.impl.ReferenceFilter" onMatch="INCLUDE">
      <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="REGEX" partial="false" replaceAll="false">https://eur-lex\.europa\.eu/\./legal-content/EN/TXT/PDF/.*</valueMatcher>
    </filter>
  </documentFilters>
  <importer class="com.norconex.importer.ImporterConfig">
    <tempDir>/tmp</tempDir>
    <parseErrorsSaveDir/>
    <maxMemoryInstance>100000000</maxMemoryInstance>
    <maxMemoryPool>1000000000</maxMemoryPool>
    <preParseHandlers>
      <handler class="com.norconex.importer.handler.tagger.impl.DOMTagger" parser="html">
        <dom delete="false" extract="text" matchBlanks="false" onSet="OPTIONAL" selector="body" toField="fetch_content"/>
        <restrictTo>
          <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
          <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/atom+xml</valueMatcher>
        </restrictTo>
        <restrictTo>
          <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
          <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/mathml+xml</valueMatcher>
        </restrictTo>
        <restrictTo>
          <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
          <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/rss+xml</valueMatcher>
        </restrictTo>
        <restrictTo>
          <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
          <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/vnd.wap.xhtml+xml</valueMatcher>
        </restrictTo>
        <restrictTo>
          <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
          <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/x-asp</valueMatcher>
        </restrictTo>
        <restrictTo>
          <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
          <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/xhtml+xml</valueMatcher>
        </restrictTo>
        <restrictTo>
          <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
          <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/xml</valueMatcher>
        </restrictTo>
        <restrictTo>
          <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
          <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/xslt+xml</valueMatcher>
        </restrictTo>
        <restrictTo>
          <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
          <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">image/svg+xml</valueMatcher>
        </restrictTo>
        <restrictTo>
          <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
          <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">text/html</valueMatcher>
        </restrictTo>
        <restrictTo>
          <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
          <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">text/xml</valueMatcher>
        </restrictTo>
        <restrictTo>
          <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
          <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="REGEX" partial="false" replaceAll="false">text/html</valueMatcher>
        </restrictTo>
      </handler>
      <handler class="com.norconex.importer.handler.tagger.impl.DOMTagger" parser="html">
        <dom delete="false" extract="text" matchBlanks="false" onSet="OPTIONAL" selector="time[itemprop=datePublished]" toField="fetch_date"/>
        <restrictTo>
          <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
          <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/atom+xml</valueMatcher>
        </restrictTo>
        <restrictTo>
          <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
          <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/mathml+xml</valueMatcher>
        </restrictTo>
        <restrictTo>
          <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
          <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/rss+xml</valueMatcher>
        </restrictTo>
        <restrictTo>
          <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
          <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/vnd.wap.xhtml+xml</valueMatcher>
        </restrictTo>
        <restrictTo>
          <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
          <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/x-asp</valueMatcher>
        </restrictTo>
        <restrictTo>
          <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
          <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/xhtml+xml</valueMatcher>
        </restrictTo>
        <restrictTo>
          <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
          <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/xml</valueMatcher>
        </restrictTo>
        <restrictTo>
          <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
          <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/xslt+xml</valueMatcher>
        </restrictTo>
        <restrictTo>
          <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
          <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">image/svg+xml</valueMatcher>
        </restrictTo>
        <restrictTo>
          <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
          <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">text/html</valueMatcher>
        </restrictTo>
        <restrictTo>
          <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
          <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">text/xml</valueMatcher>
        </restrictTo>
        <restrictTo>
          <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
          <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="REGEX" partial="false" replaceAll="false">text/html</valueMatcher>
        </restrictTo>
      </handler>
      <handler class="com.norconex.importer.handler.tagger.impl.DOMTagger" parser="html">
        <dom delete="false" extract="text" matchBlanks="false" onSet="OPTIONAL" selector="title" toField="fetch_title"/>
        <restrictTo>
          <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
          <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/atom+xml</valueMatcher>
        </restrictTo>
        <restrictTo>
          <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
          <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/mathml+xml</valueMatcher>
        </restrictTo>
        <restrictTo>
          <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
          <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/rss+xml</valueMatcher>
        </restrictTo>
        <restrictTo>
          <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
          <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/vnd.wap.xhtml+xml</valueMatcher>
        </restrictTo>
        <restrictTo>
          <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
          <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/x-asp</valueMatcher>
        </restrictTo>
        <restrictTo>
          <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
          <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/xhtml+xml</valueMatcher>
        </restrictTo>
        <restrictTo>
          <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
          <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/xml</valueMatcher>
        </restrictTo>
        <restrictTo>
          <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
          <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/xslt+xml</valueMatcher>
        </restrictTo>
        <restrictTo>
          <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
          <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">image/svg+xml</valueMatcher>
        </restrictTo>
        <restrictTo>
          <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
          <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">text/html</valueMatcher>
        </restrictTo>
        <restrictTo>
          <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
          <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">text/xml</valueMatcher>
        </restrictTo>
        <restrictTo>
          <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
          <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="REGEX" partial="false" replaceAll="false">text/html</valueMatcher>
        </restrictTo>
      </handler>
    </preParseHandlers>
    <documentParserFactory class="com.norconex.importer.parser.GenericDocumentParserFactory">
      <fallbackParser class="com.norconex.importer.parser.impl.FallbackParser"/>
      <parsers>
        <parser class="com.norconex.importer.parser.impl.xfdl.XFDLParser" contentType="application/vnd.xfdl"/>
      </parsers>
    </documentParserFactory>
    <postParseHandlers>
      <handler class="com.norconex.importer.handler.tagger.impl.ConstantTagger">
        <constant name="fetch-id">defaultGroupNumeric_3940f689-b7b9-4581-9a14-81ee2bbfdfce</constant>
      </handler>
    </postParseHandlers>
    <responseProcessors/>
  </importer>
  <committers>
    <committer class="com.oxway.fetcher.norconex.committers.SampleCommitter"/>
  </committers>
  <metadataChecksummer class="com.norconex.collector.http.checksum.impl.LastModifiedMetadataChecksummer" keep="false" toField="collector.checksum-metadata"/>
  <metadataDeduplicate>false</metadataDeduplicate>
  <documentChecksummer class="com.norconex.collector.core.checksum.impl.MD5DocumentChecksummer" combineFieldsAndContent="false" keep="false" toField="collector.checksum-doc">
    <fieldMatcher ignoreCase="false" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false"/>
  </documentChecksummer>
  <documentDeduplicate>false</documentDeduplicate>
  <spoiledReferenceStrategizer class="com.norconex.collector.core.spoil.impl.GenericSpoiledReferenceStrategizer" fallbackStrategy="DELETE">
    <mapping state="BAD_STATUS" strategy="GRACE_ONCE"/>
    <mapping state="NOT_FOUND" strategy="DELETE"/>
    <mapping state="ERROR" strategy="GRACE_ONCE"/>
  </spoiledReferenceStrategizer>
  <eventListeners>
    <listener class="com.oxway.fetcher.norconex.listeners.MongoDB2StatiticsListener"/>
    <listener class="com.norconex.collector.core.crawler.event.impl.StopCrawlerOnMaxEventListener" maximum="100" onMultiple="ANY">
      <eventMatcher ignoreCase="false" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">DOCUMENT_COMMITTED_UPSERT</eventMatcher>
    </listener>
  </eventListeners>
  <maxDepth>1</maxDepth>
  <keepDownloads>false</keepDownloads>
  <keepReferencedLinks>INSCOPE</keepReferencedLinks>
  <fetchHttpHead>DISABLED</fetchHttpHead>
  <fetchHttpGet>REQUIRED</fetchHttpGet>
  <startURLs async="false" includeSubdomains="false" stayOnDomain="false" stayOnPort="false" stayOnProtocol="false">
    <url>https://eur-lex.europa.eu/search.html?textScope0=ti&amp;lang=en&amp;SUBDOM_INIT=ALL_ALL&amp;DTS_DOM=ALL&amp;type=advanced&amp;DTS_SUBDOM=ALL_ALL&amp;qid=1653030108454&amp;andText0=plastic%3F&amp;sortOne=DD&amp;sortOneOrder=desc</url>
  </startURLs>
  <urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer" disabled="false">
    <normalizations>removeFragment,lowerCaseSchemeHost,upperCaseEscapeSequence,decodeUnreservedCharacters,removeDefaultPort,encodeNonURICharacters,removeDotSegments,removeFragment</normalizations>
  </urlNormalizer>
  <delay class="com.norconex.collector.http.delay.impl.GenericDelayResolver" default="3000" ignoreRobotsCrawlDelay="false" scope="crawler"/>
  <robotsTxt class="com.norconex.collector.http.robot.impl.StandardRobotsTxtProvider" ignore="false"/>
  <sitemapResolver class="com.norconex.collector.http.sitemap.impl.GenericSitemapResolver" ignore="true" lenient="false">
    <tempDir/>
    <path>/sitemap.xml</path>
    <path>/sitemap_index.xml</path>
  </sitemapResolver>
  <canonicalLinkDetector/>
  <recrawlableResolver class="com.oxway.fetcher.norconex.http.recrawl.OxwayCacheRecrawlableResolver"/>
  <httpFetchers maxRetries="0" retryDelay="0">
    <fetcher class="com.norconex.collector.http.fetch.impl.GenericHttpFetcher">
      <forceContentTypeDetection>false</forceContentTypeDetection>
      <forceCharsetDetection>false</forceCharsetDetection>
      <validStatusCodes>200</validStatusCodes>
      <notFoundStatusCodes>404</notFoundStatusCodes>
      <headersPrefix/>
      <userAgent>Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.99 Safari/537.36</userAgent>
      <cookieSpec>standard</cookieSpec>
      <proxySettings>
        <host/>
        <scheme/>
        <realm/>
        <credentials>
          <username/>
          <password/>
          <passwordKey/>
        </credentials>
      </proxySettings>
      <connectionTimeout>30000</connectionTimeout>
      <socketTimeout>30000</socketTimeout>
      <connectionRequestTimeout>30000</connectionRequestTimeout>
      <connectionCharset/>
      <expectContinueEnabled>false</expectContinueEnabled>
      <maxRedirects>50</maxRedirects>
      <localAddress/>
      <maxConnections>200</maxConnections>
      <maxConnectionsPerRoute>20</maxConnectionsPerRoute>
      <maxConnectionIdleTime>10000</maxConnectionIdleTime>
      <maxConnectionInactiveTime>0</maxConnectionInactiveTime>
      <headers/>
      <disableIfModifiedSince>false</disableIfModifiedSince>
      <disableETag>false</disableETag>
      <redirectURLProvider class="com.norconex.collector.http.fetch.util.GenericRedirectURLProvider" fallbackCharset="UTF-8"/>
      <trustAllSSLCertificates>false</trustAllSSLCertificates>
      <disableSNI>false</disableSNI>
      <disableHSTS>false</disableHSTS>
      <httpMethods>GET,HEAD</httpMethods>
      <referenceFilters/>
    </fetcher>
  </httpFetchers>
  <robotsMeta class="com.norconex.collector.http.robot.impl.StandardRobotsMetaProvider" ignore="true"/>
  <linkExtractors>
    <extractor class="com.norconex.collector.http.link.impl.HtmlLinkExtractor" commentsEnabled="false" ignoreNofollow="false" maxURLLength="2048">
      <schemes>http,https,ftp</schemes>
      <tags>
        <tag attribute="href" name="a"/>
        <tag attribute="src" name="img"/>
        <tag attribute="http-equiv" name="meta"/>
        <tag attribute="src" name="iframe"/>
        <tag attribute="src" name="frame"/>
      </tags>
      <fieldMatcher ignoreCase="false" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false"/>
      <restrictTo>
        <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
        <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/vnd.wap.xhtml+xml</valueMatcher>
      </restrictTo>
      <restrictTo>
        <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
        <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/xhtml+xml</valueMatcher>
      </restrictTo>
      <restrictTo>
        <fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
        <valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">text/html</valueMatcher>
      </restrictTo>
    </extractor>
  </linkExtractors>
  <preImportProcessors/>
  <postImportProcessors/>
  <postImportLinks keep="false">
    <fieldMatcher ignoreCase="false" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false"/>
  </postImportLinks>
</collector>
essiembre commented 2 years ago

The issue was a mishandling of the colon in your URL. I made a new snapshot release with a fix. Please confirm.

sylvainroussy commented 2 years ago

It works ! Thank you very much. Closing.