commit only urls that contain specific words

caesetia commented 5 years ago

Hi, I'm trying to crawl a page and commit only urls that have the word "colleague" in them. I tried adding the following in the minimum-config file:

   <httpDocumentFilters>
    <filter class="com.norconex.collector.http.filter.impl.RegexURLFilter"
            onMatch="include" >
      http://.*/Colleague/.*
    </filter>
  </httpDocumentFilters>

but when I run the crawler, I get an error that says "(XML) HttpCrawlerConfig: cvc-complex-type.2.4.a: Invalid content was found starting with element 'httpDocumentFilters'. One of '{numThreads, maxDocuments, stopOnExceptions, orphansStrategy, referenceFilters, metadataFilters, documentFilters, crawlerListeners, crawlDataStoreFactory, documentChecksummer, committer, spoiledReferenceStrategizer, keepDownloads, keepOutOfScopeLinks, userAgent, urlNormalizer, httpClientFactory, robotsTxt, redirectURLProvider, recrawlableResolver, metadataFetcher, canonicalLinkDetector, metadataChecksummer, documentFetcher, robotsMeta, linkExtractors, preImportProcessors, postImportProcessors}' is expected." and urls without the word Colleague are being committed too.

Alternatively, if there's a way to commit only pdfs...?

Thanks for any help you can provide.

essiembre commented 5 years ago

It should rather be:

<documentFilters>
  <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
      https?://.*/Colleague/.*
  </filter> 
</documentFilters>

Did you find bad documentation somewhere that suggested your approach?

To exclude PDFs, you can use this instead (or in addition):

<documentFilters>
  <filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter" onMatch="include">
      pdf
  </filter>
</documentFilters>

caesetia commented 5 years ago

I think I got that from a response to a stackoverflow question, but I might have also understood it wrong. Thanks for your help!

Norconex / crawlers

commit only urls that contain specific words #633