Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

commit only urls that contain specific words #633

Closed caesetia closed 5 years ago

caesetia commented 5 years ago

Hi, I'm trying to crawl a page and commit only urls that have the word "colleague" in them. I tried adding the following in the minimum-config file:

   <httpDocumentFilters>
    <filter class="com.norconex.collector.http.filter.impl.RegexURLFilter"
            onMatch="include" >
      http://.*/Colleague/.*
    </filter>
  </httpDocumentFilters>

but when I run the crawler, I get an error that says "(XML) HttpCrawlerConfig: cvc-complex-type.2.4.a: Invalid content was found starting with element 'httpDocumentFilters'. One of '{numThreads, maxDocuments, stopOnExceptions, orphansStrategy, referenceFilters, metadataFilters, documentFilters, crawlerListeners, crawlDataStoreFactory, documentChecksummer, committer, spoiledReferenceStrategizer, keepDownloads, keepOutOfScopeLinks, userAgent, urlNormalizer, httpClientFactory, robotsTxt, redirectURLProvider, recrawlableResolver, metadataFetcher, canonicalLinkDetector, metadataChecksummer, documentFetcher, robotsMeta, linkExtractors, preImportProcessors, postImportProcessors}' is expected." and urls without the word Colleague are being committed too.

Alternatively, if there's a way to commit only pdfs...?

Thanks for any help you can provide.

essiembre commented 5 years ago

It should rather be:

<documentFilters>
  <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
      https?://.*/Colleague/.*
  </filter> 
</documentFilters>

Did you find bad documentation somewhere that suggested your approach?

To exclude PDFs, you can use this instead (or in addition):

<documentFilters>
  <filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter" onMatch="include">
      pdf
  </filter>
</documentFilters>
caesetia commented 5 years ago

I think I got that from a response to a stackoverflow question, but I might have also understood it wrong. Thanks for your help!