Closed sylvainroussy closed 2 years ago
Make sure regex is correct and regex characters like '?' are escaped. ^https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/\?uri=.*
Bonjour Sylvain,
Have you resolved your issue? Else can you share a config to reproduce?
Maybe: make sure your start URL is not excluded from your filter rule.
Hi Pascal,
The real pattern is : **https://eur-lex\.europa\.eu/legal-content/EN/TXT/PDF/\?uri=.* (we have a simplified form for users, transformed to real regex expression) And this pattern match the following url : https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:52022M10723** (checked on https://www.freeformatter.com/java-regex-tester.html)
What is strange it seems not all the urls are rebuilt in complete form, some examples :
1) ./legal-content/AUTO/?uri=CELEX:72019L0904IRL_202202086&qid=1653030108454&rid=2 2) https://eur-lex.europa.eu/./images/n/eurlex-logo.jpg 3) https://eur-lex.europa.eu/./browse/institutions/auditors.html
Source code for the point 1:
<a id="cellar_e00cf74f-b3f8-11ec-9d96-01aa75ed71a1" href="./legal-content/AUTO/?uri=CELEX:72019L0904IRL_202202086&qid=1653030108454&rid=2" class="title" name="https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:72019L0904IRL_202202086">European Union (Single Use Plastics) (Amendment) Regulations 2022</a>
Source code for the point 2:
<img class="pdf-logo-img" src="./images/n/eurlex-logo.jpg" alt="Back to EUR-Lex homepage" />
Source code for the point 3:
<li class=""><a href="./browse/institutions/auditors.html" id="Court-of-Auditors" title="European Court of Auditors">European Court of Auditors</a></li>
Strange behaviour, isn't it?
As you know it's difficult to provide you a complete configuration because the configuration is directly Java coded.
I was able to reproduce and I'll investigate the issue.
FYI, you can share your config using code as well:
XML xml = new XML("collector");
myCollectorInstance.getCollectorConfig().saveToXML(xml);
String printOrSaveThisString= xml.toString(2);
It is the equivalent of generating it on the command line with the command configrender
.
Hi Pascal,
Here my configuration:
<collector id="3940f689-b7b9-4581-9a14-81ee2bbfdfce">
<numThreads>2</numThreads>
<maxDocuments>-1</maxDocuments>
<stopOnExceptions/>
<orphansStrategy>IGNORE</orphansStrategy>
<dataStoreEngine class="com.norconex.collector.core.store.impl.mongodb.MongoDataStoreEngine">
<connectionString>mongodb://*.*.*.*:****/oxway-fetchs-defaultGroupNumeric</connectionString>
</dataStoreEngine>
<referenceFilters>
<filter class="com.norconex.collector.core.filter.impl.ReferenceFilter" onMatch="INCLUDE">
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="REGEX" partial="false" replaceAll="false">https://eur-lex\.europa\.eu/search\.html\?textScope0=ti&lang=en&SUBDOM_INIT=ALL_ALL&DTS_DOM=ALL&type=advanced&DTS_SUBDOM=ALL_ALL&qid=1653030108454&andText0=plastic%3F&sortOne=DD&sortOneOrder=desc</valueMatcher>
</filter>
<filter class="com.norconex.collector.core.filter.impl.ReferenceFilter" onMatch="INCLUDE">
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="REGEX" partial="false" replaceAll="false">https://eur-lex\.europa\.eu/legal-content/EN/TXT/PDF/.*</valueMatcher>
</filter>
<filter class="com.norconex.collector.core.filter.impl.ReferenceFilter" onMatch="INCLUDE">
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="REGEX" partial="false" replaceAll="false">https://eur-lex\.europa\.eu/search\.html\?textScope0=ti&lang=en&SUBDOM_INIT=ALL_ALL&DTS_DOM=ALL&type=advanced&DTS_SUBDOM=ALL_ALL&qid=1653030108454&andText0=plastic%3F&sortOne=DD&sortOneOrder=desc</valueMatcher>
</filter>
<filter class="com.norconex.collector.core.filter.impl.ReferenceFilter" onMatch="INCLUDE">
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="REGEX" partial="false" replaceAll="false">https://eur-lex\.europa\.eu/\./legal-content/EN/TXT/PDF/.*</valueMatcher>
</filter>
<filter caseSensitive="false" class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter" onMatch="EXCLUDE">jpg,gif,png,ico,css,js,gz,bz,jpeg</filter>
</referenceFilters>
<metadataFilters/>
<documentFilters>
<filter class="com.norconex.collector.core.filter.impl.ReferenceFilter" onMatch="INCLUDE">
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="REGEX" partial="false" replaceAll="false">https://eur-lex\.europa\.eu/legal-content/EN/TXT/PDF/.*</valueMatcher>
</filter>
<filter class="com.norconex.collector.core.filter.impl.ReferenceFilter" onMatch="INCLUDE">
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="REGEX" partial="false" replaceAll="false">https://eur-lex\.europa\.eu/search\.html\?textScope0=ti&lang=en&SUBDOM_INIT=ALL_ALL&DTS_DOM=ALL&type=advanced&DTS_SUBDOM=ALL_ALL&qid=1653030108454&andText0=plastic%3F&sortOne=DD&sortOneOrder=desc</valueMatcher>
</filter>
<filter class="com.norconex.collector.core.filter.impl.ReferenceFilter" onMatch="INCLUDE">
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="REGEX" partial="false" replaceAll="false">https://eur-lex\.europa\.eu/\./legal-content/EN/TXT/PDF/.*</valueMatcher>
</filter>
</documentFilters>
<importer class="com.norconex.importer.ImporterConfig">
<tempDir>/tmp</tempDir>
<parseErrorsSaveDir/>
<maxMemoryInstance>100000000</maxMemoryInstance>
<maxMemoryPool>1000000000</maxMemoryPool>
<preParseHandlers>
<handler class="com.norconex.importer.handler.tagger.impl.DOMTagger" parser="html">
<dom delete="false" extract="text" matchBlanks="false" onSet="OPTIONAL" selector="body" toField="fetch_content"/>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/atom+xml</valueMatcher>
</restrictTo>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/mathml+xml</valueMatcher>
</restrictTo>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/rss+xml</valueMatcher>
</restrictTo>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/vnd.wap.xhtml+xml</valueMatcher>
</restrictTo>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/x-asp</valueMatcher>
</restrictTo>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/xhtml+xml</valueMatcher>
</restrictTo>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/xml</valueMatcher>
</restrictTo>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/xslt+xml</valueMatcher>
</restrictTo>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">image/svg+xml</valueMatcher>
</restrictTo>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">text/html</valueMatcher>
</restrictTo>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">text/xml</valueMatcher>
</restrictTo>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="REGEX" partial="false" replaceAll="false">text/html</valueMatcher>
</restrictTo>
</handler>
<handler class="com.norconex.importer.handler.tagger.impl.DOMTagger" parser="html">
<dom delete="false" extract="text" matchBlanks="false" onSet="OPTIONAL" selector="time[itemprop=datePublished]" toField="fetch_date"/>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/atom+xml</valueMatcher>
</restrictTo>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/mathml+xml</valueMatcher>
</restrictTo>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/rss+xml</valueMatcher>
</restrictTo>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/vnd.wap.xhtml+xml</valueMatcher>
</restrictTo>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/x-asp</valueMatcher>
</restrictTo>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/xhtml+xml</valueMatcher>
</restrictTo>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/xml</valueMatcher>
</restrictTo>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/xslt+xml</valueMatcher>
</restrictTo>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">image/svg+xml</valueMatcher>
</restrictTo>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">text/html</valueMatcher>
</restrictTo>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">text/xml</valueMatcher>
</restrictTo>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="REGEX" partial="false" replaceAll="false">text/html</valueMatcher>
</restrictTo>
</handler>
<handler class="com.norconex.importer.handler.tagger.impl.DOMTagger" parser="html">
<dom delete="false" extract="text" matchBlanks="false" onSet="OPTIONAL" selector="title" toField="fetch_title"/>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/atom+xml</valueMatcher>
</restrictTo>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/mathml+xml</valueMatcher>
</restrictTo>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/rss+xml</valueMatcher>
</restrictTo>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/vnd.wap.xhtml+xml</valueMatcher>
</restrictTo>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/x-asp</valueMatcher>
</restrictTo>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/xhtml+xml</valueMatcher>
</restrictTo>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/xml</valueMatcher>
</restrictTo>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/xslt+xml</valueMatcher>
</restrictTo>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">image/svg+xml</valueMatcher>
</restrictTo>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">text/html</valueMatcher>
</restrictTo>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">text/xml</valueMatcher>
</restrictTo>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="REGEX" partial="false" replaceAll="false">text/html</valueMatcher>
</restrictTo>
</handler>
</preParseHandlers>
<documentParserFactory class="com.norconex.importer.parser.GenericDocumentParserFactory">
<fallbackParser class="com.norconex.importer.parser.impl.FallbackParser"/>
<parsers>
<parser class="com.norconex.importer.parser.impl.xfdl.XFDLParser" contentType="application/vnd.xfdl"/>
</parsers>
</documentParserFactory>
<postParseHandlers>
<handler class="com.norconex.importer.handler.tagger.impl.ConstantTagger">
<constant name="fetch-id">defaultGroupNumeric_3940f689-b7b9-4581-9a14-81ee2bbfdfce</constant>
</handler>
</postParseHandlers>
<responseProcessors/>
</importer>
<committers>
<committer class="com.oxway.fetcher.norconex.committers.SampleCommitter"/>
</committers>
<metadataChecksummer class="com.norconex.collector.http.checksum.impl.LastModifiedMetadataChecksummer" keep="false" toField="collector.checksum-metadata"/>
<metadataDeduplicate>false</metadataDeduplicate>
<documentChecksummer class="com.norconex.collector.core.checksum.impl.MD5DocumentChecksummer" combineFieldsAndContent="false" keep="false" toField="collector.checksum-doc">
<fieldMatcher ignoreCase="false" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false"/>
</documentChecksummer>
<documentDeduplicate>false</documentDeduplicate>
<spoiledReferenceStrategizer class="com.norconex.collector.core.spoil.impl.GenericSpoiledReferenceStrategizer" fallbackStrategy="DELETE">
<mapping state="BAD_STATUS" strategy="GRACE_ONCE"/>
<mapping state="NOT_FOUND" strategy="DELETE"/>
<mapping state="ERROR" strategy="GRACE_ONCE"/>
</spoiledReferenceStrategizer>
<eventListeners>
<listener class="com.oxway.fetcher.norconex.listeners.MongoDB2StatiticsListener"/>
<listener class="com.norconex.collector.core.crawler.event.impl.StopCrawlerOnMaxEventListener" maximum="100" onMultiple="ANY">
<eventMatcher ignoreCase="false" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">DOCUMENT_COMMITTED_UPSERT</eventMatcher>
</listener>
</eventListeners>
<maxDepth>1</maxDepth>
<keepDownloads>false</keepDownloads>
<keepReferencedLinks>INSCOPE</keepReferencedLinks>
<fetchHttpHead>DISABLED</fetchHttpHead>
<fetchHttpGet>REQUIRED</fetchHttpGet>
<startURLs async="false" includeSubdomains="false" stayOnDomain="false" stayOnPort="false" stayOnProtocol="false">
<url>https://eur-lex.europa.eu/search.html?textScope0=ti&lang=en&SUBDOM_INIT=ALL_ALL&DTS_DOM=ALL&type=advanced&DTS_SUBDOM=ALL_ALL&qid=1653030108454&andText0=plastic%3F&sortOne=DD&sortOneOrder=desc</url>
</startURLs>
<urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer" disabled="false">
<normalizations>removeFragment,lowerCaseSchemeHost,upperCaseEscapeSequence,decodeUnreservedCharacters,removeDefaultPort,encodeNonURICharacters,removeDotSegments,removeFragment</normalizations>
</urlNormalizer>
<delay class="com.norconex.collector.http.delay.impl.GenericDelayResolver" default="3000" ignoreRobotsCrawlDelay="false" scope="crawler"/>
<robotsTxt class="com.norconex.collector.http.robot.impl.StandardRobotsTxtProvider" ignore="false"/>
<sitemapResolver class="com.norconex.collector.http.sitemap.impl.GenericSitemapResolver" ignore="true" lenient="false">
<tempDir/>
<path>/sitemap.xml</path>
<path>/sitemap_index.xml</path>
</sitemapResolver>
<canonicalLinkDetector/>
<recrawlableResolver class="com.oxway.fetcher.norconex.http.recrawl.OxwayCacheRecrawlableResolver"/>
<httpFetchers maxRetries="0" retryDelay="0">
<fetcher class="com.norconex.collector.http.fetch.impl.GenericHttpFetcher">
<forceContentTypeDetection>false</forceContentTypeDetection>
<forceCharsetDetection>false</forceCharsetDetection>
<validStatusCodes>200</validStatusCodes>
<notFoundStatusCodes>404</notFoundStatusCodes>
<headersPrefix/>
<userAgent>Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.99 Safari/537.36</userAgent>
<cookieSpec>standard</cookieSpec>
<proxySettings>
<host/>
<scheme/>
<realm/>
<credentials>
<username/>
<password/>
<passwordKey/>
</credentials>
</proxySettings>
<connectionTimeout>30000</connectionTimeout>
<socketTimeout>30000</socketTimeout>
<connectionRequestTimeout>30000</connectionRequestTimeout>
<connectionCharset/>
<expectContinueEnabled>false</expectContinueEnabled>
<maxRedirects>50</maxRedirects>
<localAddress/>
<maxConnections>200</maxConnections>
<maxConnectionsPerRoute>20</maxConnectionsPerRoute>
<maxConnectionIdleTime>10000</maxConnectionIdleTime>
<maxConnectionInactiveTime>0</maxConnectionInactiveTime>
<headers/>
<disableIfModifiedSince>false</disableIfModifiedSince>
<disableETag>false</disableETag>
<redirectURLProvider class="com.norconex.collector.http.fetch.util.GenericRedirectURLProvider" fallbackCharset="UTF-8"/>
<trustAllSSLCertificates>false</trustAllSSLCertificates>
<disableSNI>false</disableSNI>
<disableHSTS>false</disableHSTS>
<httpMethods>GET,HEAD</httpMethods>
<referenceFilters/>
</fetcher>
</httpFetchers>
<robotsMeta class="com.norconex.collector.http.robot.impl.StandardRobotsMetaProvider" ignore="true"/>
<linkExtractors>
<extractor class="com.norconex.collector.http.link.impl.HtmlLinkExtractor" commentsEnabled="false" ignoreNofollow="false" maxURLLength="2048">
<schemes>http,https,ftp</schemes>
<tags>
<tag attribute="href" name="a"/>
<tag attribute="src" name="img"/>
<tag attribute="http-equiv" name="meta"/>
<tag attribute="src" name="iframe"/>
<tag attribute="src" name="frame"/>
</tags>
<fieldMatcher ignoreCase="false" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false"/>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/vnd.wap.xhtml+xml</valueMatcher>
</restrictTo>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">application/xhtml+xml</valueMatcher>
</restrictTo>
<restrictTo>
<fieldMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">document.contentType</fieldMatcher>
<valueMatcher ignoreCase="true" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false">text/html</valueMatcher>
</restrictTo>
</extractor>
</linkExtractors>
<preImportProcessors/>
<postImportProcessors/>
<postImportLinks keep="false">
<fieldMatcher ignoreCase="false" ignoreDiacritic="false" method="BASIC" partial="false" replaceAll="false"/>
</postImportLinks>
</collector>
The issue was a mishandling of the colon in your URL. I made a new snapshot release with a fix. Please confirm.
It works ! Thank you very much. Closing.
Hello!
From this start URL : https://eur-lex.europa.eu/search.html?textScope0=ti&lang=en&SUBDOM_INIT=ALL_ALL&DTS_DOM=ALL&type=advanced&DTS_SUBDOM=ALL_ALL&qid=1653030108454&andText0=plastic%3F&sortOne=DD&sortOneOrder=desc
We are not be able to download relatives URLs like (for PDF docs) : CrawlerEvent.REJECTED_FILTER - ./legal-content/EN/TXT/HTML/?uri=CELEX:32021D1752 - No "include" reference filters matched.
With this pattern filter pattern : https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=.*
Of course removeDotSegments normalizer is used.
Something is bad here ? Thanks.