Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
184 stars 67 forks source link

How do I filter out SVG and other image files? #350

Closed dkh7m closed 7 years ago

dkh7m commented 7 years ago

I'm very new to Norconex and am trying to configure it to crawl a site and add it to an existing Solr index. I've got a lot of issues, but I'll start with this one. When I run the crawler, it is including SVG files, which I don't want it to do. I only want to crawl web pages and documents.

I have this line in my section, but it is not stopping the collector from adding SVG files:

<filter onMatch="exclude" class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter"> js,jpg,gif,png,svg,ico,css </filter>

I'm assuming I either have this line in the wrong place, or maybe I need to use a different filter. Any help would be most appreciated.

essiembre commented 7 years ago

Hello Kyle, we'll do our best to help you get started. Where do you have this config line? Can you share your config to help reproduce? Do you get validation errors in your logs? If not, I suspect what you configured about to be OK.

Do you have the URL of an example SVG file? Does it have the .svg extension? Else, you may have to filter them some other way (relying on URL pattern, content-type, etc.).

dkh7m commented 7 years ago

Hi Pascal,

Thanks for the quick response. My config file is below. I've been cobbling together bits and pieces from other posts here along with the minimum example to try and get it working. Initially I was able to get content into my Solr index with the minimum example, but once I started trying to extend things to fit our production Solr schema, that's where it went off the rails.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>
<httpcollector id="Minimum Config HTTP Collector">

  <!-- Decide where to store generated files. -->
  <progressDir>./crawler-output/uvahealthPlone/progress</progressDir>
  <logsDir>./crawler-output/uvahealthPlone/logs</logsDir>

  <crawlers>
    <crawler id="Norconex Minimum Test Page">

      <!-- Requires at least one start URL (or urlsFile). 
           Optionally limit crawling to same protocol/domain/port as 
           start URLs. -->
      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
        <url>http://hit.healthsystem.virginia.edu</url>
      </startURLs>

      <filter onMatch="exclude" class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter">
        js,jpg,gif,png,svg,ico,css
      </filter>

      <!-- === Recommendations: ============================================ -->

      <!-- Specify a crawler default directory where to generate files. -->
      <workDir>./crawler-output/uvahealthPlone</workDir>

      <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
      <maxDepth>10</maxDepth>

      <numThreads>2</numThreads>

      <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
      <sitemapResolverFactory ignore="true" />

      <!-- Be as nice as you can to sites you crawl. -->
      <delay default="1000" />

      <!-- Document importing -->
      <importer>
        <postParseHandlers>
          <!-- If your target repository does not support arbitrary fields,
               make sure you only keep the fields you need. -->
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
         <fields>title,Title,keywords,dockeywords,description,Description,document.reference,FullURL</fields>
          </tagger>
          <tagger class="com.norconex.importer.handler.tagger.impl.RenameTagger">
              <rename fromField="title" toField="Title" overwrite="true" />
              <rename fromField="keywords" toField="dockeywords" overwrite="true" />
              <rename fromField="description" toField="Description" overwrite="true" />
              <rename fromField="document.reference" toField="FullURL" overwrite="true" />
          </tagger>
          <tagger class="com.norconex.importer.handler.tagger.impl.DebugTagger" logContent="false" logLevel="INFO" />
        </postParseHandlers>
      </importer>
      <!-- Decide what to do with your files by specifying a Committer. -->
      <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
        <directory>./crawler-output/uvahealthPlone/crawledFiles</directory>
      </committer>
      <committer class="com.norconex.committer.solr.SolrCommitter">
          <solrURL>http://localhost:8983/solr/#/uvahealthPlone/</solrURL>
          <solrUpdateURLParams>
             <param name="commit">true</param>
          </solrUpdateURLParams>
          <commitDisabled>false</commitDisabled>
          <sourceReferenceField keep="false">committer.reference</sourceReferenceField>
          <targetReferenceField>UID</targetReferenceField>
          <sourceContentField keep="false">content</sourceContentField>
          <targetContentField>searchableText</targetContentField>
          <queueSize>1</queueSize>
          <commitBatchSize>1</commitBatchSize>
      </committer>
    </crawler>
  </crawlers>
</httpcollector>

Here's an example of one of the SVG files that's being index. It's an image that is part of our site's theme.

http://hit.healthsystem.virginia.edu/default/includes/themes/Healthsystem%20Information%20Technology/images/home/icon-home-scrollup.svg

I'm not getting any validation errors and am seeing results in the output folder for my file system committer.

To give a little more background on the overall project, we have three websites (built in the Plone CMS) that we have indexed via a Solr plugin specific to that CMS. We have another site, built in the Mura CMS (ColdFusion based) that we need to add to the existing Solr index. The problem I'm facing is that the fields indexed in Plone are different than those in Mura (rather, they're the same, but have different naming conventions). What I need to do is not only index the Mura site, but during collecting, importing or committing, to change the names of the indexed fields to match those in the current Solr schema.

I don't have any Solr or Java background, so everything I've done so far has been through lots of Googling and trial-and-error. So I greatly appreciate any help you can offer. Thanks!

essiembre commented 7 years ago

Your XML configuration has a few problems. You should have validation errors to that effect on the first line of your logs, unless you are using an older version. What version of the HTTP Collector are you using?

Errors detected:

Your <filter> tag needs to be within <referenceFilters> tags, like this:

<referenceFilters>
    <filter class="..." />
    ...
</referenceFilters>

You cannot define multiple committers like you are doing. Luckily, there is a MultiCommitter whose job is to dispatch to more commiters, like this:

<committer class="com.norconex.committer.core.impl.MultiCommitter">
    <committer class="(committer class)">
        (Commmitter-specific configuration here)
    </committer>
    <committer class="(committer class)">
        (Commmitter-specific configuration here)
    </committer>
    ...
</committer>

I recommend you have a look at this page for proper XML syntax.

Also, if you want to make sure your Collector does not run if there are configuration errors, add the -k (or --checkcfg) argument to the command line when you start it.

dkh7m commented 7 years ago

Thanks for the tips. As I said, I've been cherrypicking code from here and there, so it's no surprise there are a few bugs. I'll clean up my XML and give it another go.

dkh7m commented 7 years ago

I reconfigured the XML based on your recommendations this morning and the collector is working. Thanks! It is no longer indexing SVG files. I do have some other issues, but since the issue this ticket was created for is resolved I will mark it closed.

dkh7m commented 7 years ago

Well, rats. I thought this was resolved, but while running the code I saw that SVG files are being added to my Solr index despite the filter apparently working. I have multiple lines in my log file like this:

HIT Crawler: 2017-05-24 10:23:51 INFO -           REJECTED_FILTER: http://hit.healthsystem.virginia.edu/default/includes/themes/Healthsystem Information Technology/images/logo-uvahs-266999.svg (ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=js,jpg,gif,png,svg,ico,css,caseSensitive=false])

Is there anything I need to add to the committer to prevent Solr from indexing files that should be excluded by the crawler? Adding my updated XML for reference:

<httpcollector id="HIT Collector">

    <progressDir>./crawler-output/uvahealthPlone/progress</progressDir>
    <logsDir>./crawler-output/uvahealthPlone/logs</logsDir>

    <crawlerDefaults>
        <startURLs stayOnDomain="true" stayOnPort="false" stayOnProtocol="false">
            <url>http://hit.healthsystem.virginia.edu</url>
        </startURLs>
        <delay class="com.norconex.collector.http.delay.impl.GenericDelayResolver" default="5 seconds" ignoreRobotsCrawlDelay="true" scope="site" >
            <schedule dayOfWeek="from Saturday to Sunday">1 second</schedule>
        </delay>
        <numThreads>2</numThreads>
        <maxDepth>10</maxDepth>
        <maxDocuments>-1</maxDocuments>
        <workDir>./crawler-output/uvahealthPlone</workDir>
        <keepDownloads>false</keepDownloads>
        <orphansStrategy>DELETE</orphansStrategy>
        <crawlerListeners>
            <listener  
              class="com.norconex.collector.http.crawler.event.impl.URLStatusCrawlerEventListener">
              <statusCodes>404</statusCodes>
              <outputDir>./crawler-output/uvahealthPlone</outputDir>
              <fileNamePrefix>brokenLinks</fileNamePrefix>
            </listener>
        </crawlerListeners>
        <referenceFilters>
          <filter onMatch="exclude" class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter">
            js,jpg,gif,png,svg,ico,css
          </filter>
        </referenceFilters>
        <robotsTxt ignore="true" />
        <!-- Sitemap since 2.3.0: -->
        <sitemapResolverFactory ignore="true" />
        <metadataFetcher class="com.norconex.collector.http.fetch.impl.GenericMetadataFetcher" />
        <documentFetcher  
          class="com.norconex.collector.http.fetch.impl.GenericDocumentFetcher"
          detectContentType="true" detectCharset="true">
            <validStatusCodes>200</validStatusCodes>
            <notFoundStatusCodes>404</notFoundStatusCodes>
        </documentFetcher>

        <importer>
            <preParseHandlers>
                <!-- These tags can be mixed, in the desired order of execution. -->  
                <tagger class="com.norconex.importer.handler.tagger.impl.RenameTagger">
                    <rename fromField="title" toField="Title" overwrite="true" />
                    <rename fromField="keywords" toField="dockeywords" overwrite="true" />
                    <rename fromField="description" toField="Description" overwrite="true" />
                    <rename fromField="document.reference" toField="FullURL" overwrite="true" />
                </tagger>
                <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
                    <fields>title,Title,keywords,dockeywords,description,Description,document.reference,FullURL</fields>
                </tagger>  
            </preParseHandlers>

            <postParseHandlers>
                <tagger class="com.norconex.importer.handler.tagger.impl.RenameTagger">
                    <rename fromField="title" toField="Title" overwrite="true" />
                    <rename fromField="keywords" toField="dockeywords" overwrite="true" />
                    <rename fromField="description" toField="Description" overwrite="true" />
                    <rename fromField="Content-Location" toField="FullURL" overwrite="true" />
                </tagger>
                <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
                    <fields>title,Title,keywords,dockeywords,description,Description,document.reference,FullURL</fields>
                </tagger>      
            </postParseHandlers>     
        </importer>

        <committer class="com.norconex.committer.core.impl.MultiCommitter">
              <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
                <directory>./crawler-output/uvahealthPlone/crawledFiles</directory>
              </committer>
              <committer class="com.norconex.committer.solr.SolrCommitter">
                  <solrURL>http://localhost:8983/solr/uvahealthPlone/</solrURL>
                  <solrUpdateURLParams>
                    <param name="update"></param>
                    <param name="commit">true</param>
                  </solrUpdateURLParams>
                  <queueSize>10</queueSize>
                  <commitBatchSize>10</commitBatchSize>
              </committer>
        </committer>
    </crawlerDefaults>

    <crawlers>
        <crawler id="HIT Crawler">
        </crawler>
    </crawlers>

</httpcollector>
essiembre commented 7 years ago

REJECTED_FILTER means your filter is working and it is not committed to Solr.

Assuming you wiped out your Solr index, what may happen is that you had remaining items in your committer queue that were crawled before you made the change. Here is what you can try:

For your Solr Committer, files are first queued on file system before they are sent to Solr. Add this tag to it, specifying a unique path:

<queueDir>/a/path/where/files/are/queued</queueDir>

Then, every time you make important config changes or you are in doubt they are not taking effect properly, make sure you start fresh by deleting your "workdir". Mainly, the "progress" and "crawlstore" folders. Also delete what is in your queueDir you added above. Optionally, also wipe out your Solr index content before you try.

After these changes you should no longer be getting it.