Configuration to extract only a certain type of files

javpdiaz commented 6 years ago

I need to extract only a certain type of files from a repository, for example the .pdf, ppt, ... I am using this configuration but it does not work.

<httpcollector id="Configuracion HTTP Colector Electrom">

  #set($http = "com.norconex.collector.http")
  #set($core = "com.norconex.collector.core")
  #set($urlNormalizer   = "${http}.url.impl.GenericURLNormalizer")
  #set($filterExtension = "${core}.filter.impl.ExtensionReferenceFilter")
  #set($filterRegexRef  = "${core}.filter.impl.RegexReferenceFilter")
  #set($urlFilter = "com.norconex.collector.http.filter.impl.RegexURLFilter")

  <!-- carpetas de salida -->
  <progressDir>./electrom-output/progress</progressDir>
  <logsDir>./electrom-output/logs</logsDir>

  <crawlers>
    <crawler id="Configuracion crawler de Electrom">

      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
        <url>http://localhost/tocrawl/</url>
      </startURLs>

      <!-- directorio de salida de resultados -->
      <workDir>./electrom-output</workDir>

      <!-- profundidad del crawling -->
      <maxDepth>2</maxDepth>

      <!-- ignorar el sitemap para no hacer crawl del sitio entero -->
      <sitemapResolverFactory ignore="true" />

      <!-- retraso entre pedidos del crawl al sitio para evitar que el server rechace
      la conexion -->
      <delay default="2000" />

        <referenceFilters>
            <!--<filter class="$filterExtension" onMatch="exclude">jpg,gif,png,ico,css,js,svg</filter>-->
            <filter class="$filterExtension">pdf,ppt</filter>
            <!--<filter class="$filterRegexRef">http://localhost/tocrawl/.*</filter>-->
        </referenceFilters>

      <importer>
        <postParseHandlers>
          <!-- If your target repository does not support arbitrary fields,
               make sure you only keep the fields you need. -->
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>title,keywords,description,document.reference</fields>
          </tagger>
        </postParseHandlers>
      </importer>

      <committer class="com.norconex.committer.solr.SolrCommitter">
                <solrURL>http://localhost:8983/solr/electrom</solrURL>
      </committer>

    </crawler>
  </crawlers>

</httpcollector>

I know it's a problem with the filters but I do not know how to fix it.

If I leave it this way if it works the crawler but it sends me many elements for the Solr that I do not need.

<referenceFilters>
            <filter class="$filterExtension" onMatch="exclude">jpg,gif,png,ico,css,js,svg</filter>
            <!--<filter class="$filterExtension">pdf,ppt</filter>-->
            <!--<filter class="$filterRegexRef">http://localhost/tocrawl/.*</filter>-->
</referenceFilters>

essiembre commented 6 years ago

The reference filters are taking places before a document is downloaded. In your case, if you want to keep PDF and PPT, it needs to crawl a bunch of HTML pages before it gets to those. You are not giving it a chance to get there.

There are a few ways around this, but one is to use document filters instead, which takes place after documents are downloaded and links extracted:

<documentFilters>
    <filter class="$filterExtension">pdf,ppt</filter>
</documentFilters>

You can get the execution flow here.

Let me know if that resolves your issue.

javpdiaz commented 6 years ago

It works perfect, many thanks

essiembre commented 6 years ago

Great. Thanks for confirming.

Norconex / crawlers

Configuration to extract only a certain type of files #485