Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Configuration to extract only a certain type of files #485

Closed javpdiaz closed 6 years ago

javpdiaz commented 6 years ago

I need to extract only a certain type of files from a repository, for example the .pdf, ppt, ... I am using this configuration but it does not work.

<httpcollector id="Configuracion HTTP Colector Electrom">

  #set($http = "com.norconex.collector.http")
  #set($core = "com.norconex.collector.core")
  #set($urlNormalizer   = "${http}.url.impl.GenericURLNormalizer")
  #set($filterExtension = "${core}.filter.impl.ExtensionReferenceFilter")
  #set($filterRegexRef  = "${core}.filter.impl.RegexReferenceFilter")
  #set($urlFilter = "com.norconex.collector.http.filter.impl.RegexURLFilter")

  <!-- carpetas de salida -->
  <progressDir>./electrom-output/progress</progressDir>
  <logsDir>./electrom-output/logs</logsDir>

  <crawlers>
    <crawler id="Configuracion crawler de Electrom">

      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
        <url>http://localhost/tocrawl/</url>
      </startURLs>

      <!-- directorio de salida de resultados -->
      <workDir>./electrom-output</workDir>

      <!-- profundidad del crawling -->
      <maxDepth>2</maxDepth>

      <!-- ignorar el sitemap para no hacer crawl del sitio entero -->
      <sitemapResolverFactory ignore="true" />

      <!-- retraso entre pedidos del crawl al sitio para evitar que el server rechace
      la conexion -->
      <delay default="2000" />

        <referenceFilters>
            <!--<filter class="$filterExtension" onMatch="exclude">jpg,gif,png,ico,css,js,svg</filter>-->
            <filter class="$filterExtension">pdf,ppt</filter>
            <!--<filter class="$filterRegexRef">http://localhost/tocrawl/.*</filter>-->
        </referenceFilters>

      <importer>
        <postParseHandlers>
          <!-- If your target repository does not support arbitrary fields,
               make sure you only keep the fields you need. -->
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>title,keywords,description,document.reference</fields>
          </tagger>
        </postParseHandlers>
      </importer>

      <committer class="com.norconex.committer.solr.SolrCommitter">
                <solrURL>http://localhost:8983/solr/electrom</solrURL>
      </committer>

    </crawler>
  </crawlers>

</httpcollector>

I know it's a problem with the filters but I do not know how to fix it.

If I leave it this way if it works the crawler but it sends me many elements for the Solr that I do not need.

<referenceFilters>
            <filter class="$filterExtension" onMatch="exclude">jpg,gif,png,ico,css,js,svg</filter>
            <!--<filter class="$filterExtension">pdf,ppt</filter>-->
            <!--<filter class="$filterRegexRef">http://localhost/tocrawl/.*</filter>-->
</referenceFilters>
essiembre commented 6 years ago

The reference filters are taking places before a document is downloaded. In your case, if you want to keep PDF and PPT, it needs to crawl a bunch of HTML pages before it gets to those. You are not giving it a chance to get there.

There are a few ways around this, but one is to use document filters instead, which takes place after documents are downloaded and links extracted:

<documentFilters>
    <filter class="$filterExtension">pdf,ppt</filter>
</documentFilters>

You can get the execution flow here.

Let me know if that resolves your issue.

javpdiaz commented 6 years ago

It works perfect, many thanks

essiembre commented 6 years ago

Great. Thanks for confirming.