Closed caesetia closed 5 years ago
It should rather be:
<documentFilters>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
https?://.*/Colleague/.*
</filter>
</documentFilters>
Did you find bad documentation somewhere that suggested your approach?
To exclude PDFs, you can use this instead (or in addition):
<documentFilters>
<filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter" onMatch="include">
pdf
</filter>
</documentFilters>
I think I got that from a response to a stackoverflow question, but I might have also understood it wrong. Thanks for your help!
Hi, I'm trying to crawl a page and commit only urls that have the word "colleague" in them. I tried adding the following in the minimum-config file:
but when I run the crawler, I get an error that says "(XML) HttpCrawlerConfig: cvc-complex-type.2.4.a: Invalid content was found starting with element 'httpDocumentFilters'. One of '{numThreads, maxDocuments, stopOnExceptions, orphansStrategy, referenceFilters, metadataFilters, documentFilters, crawlerListeners, crawlDataStoreFactory, documentChecksummer, committer, spoiledReferenceStrategizer, keepDownloads, keepOutOfScopeLinks, userAgent, urlNormalizer, httpClientFactory, robotsTxt, redirectURLProvider, recrawlableResolver, metadataFetcher, canonicalLinkDetector, metadataChecksummer, documentFetcher, robotsMeta, linkExtractors, preImportProcessors, postImportProcessors}' is expected." and urls without the word Colleague are being committed too.
Alternatively, if there's a way to commit only pdfs...?
Thanks for any help you can provide.