Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Avoid duplicate pages #163

Closed comschmid closed 9 years ago

comschmid commented 9 years ago

First let me thank you for this wonderful piece of software!

I am using 2.3.0-SNAPSHOT and would like to avoid duplicate pages like http://example.com and http://example.com/.

So I tried to configure <urlNormalizer> with the following normalizations: lowerCaseSchemeHost, upperCaseEscapeSequence, decodeUnreservedCharacters, removeDefaultPort, removeFragment, removeDotSegments, addTrailingSlash, removeDuplicateSlashes, removeSessionIds, upperCaseEscapeSequence.

I hoped that the combination of addTrailingSlash and removeDuplicateSlashes would help to solve this issue, but unfortunately this produced ClientProtocolExceptions like 'URI does not specify a valid host name: http:/www.etools.ch//robots.txt' or the following log entry:

ERROR - StandardSitemapResolver - Cannot fetch sitemap: http:/www.etools.ch///sitemap_index.xml

(three wrong slashes at two places)

Removing the normalization enum addTrailingSlash solves this issue, but this configuration option should behave predictable.

What would be also nice, if http://example.com/ and https://example.com/ would be reduced to only the HTTPS version, assuming the meta and content MD5-checksums are equals.

BTW: is it possible to retrieve also external references, even with the setting <startURLs stayOnDomain="true">? I tried to get these external references with the crawler event type "URLS_EXTRACTED", but I only see domain-specific references in the corresponding HashSet. I don't want to crawl the external references, but just record them.

essiembre commented 9 years ago

Thanks for the good words!

About the URL normalization ClientProtocolException, it should work as you expect. I tried with both http://example.com and http://www.etools.ch/sitemap_index.xml and they both worked. Can you share the exact URL causing the issue? That would help reproduce. A copy of your config would be nice.

About http vs https, you can use the "secureScheme" normalization rule and it will convert all http into https.

About the URLs extracted, you are right that only those that match your stayOnDomain="true" will be recorded. The stayOn[...] attributes exist for convenience mostly. If you require more flexibility, I recommend you use reference filters instead to stay on your site, like the RegexReferenceFilter. That way you will have all the external URLs stored like you want (since the reference filters are triggered after the URL extraction step, which stores those extracted URLs that you want).

comschmid commented 9 years ago

Thank you for your support!

A part of the configuration is done in code (mainly the directories and the seed URLs), but I added the relevant start URL to the crawler for illustration purposes, please see below:

<?xml version="1.0" encoding="UTF-8"?>
<!-- 
   Copyright 2010-2014 Norconex Inc.

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->

<!-- This self-documented configuration file is meant to be used as a reference
     or starting point for a new configuration. 
     It contains all core features offered in this release.  Sometimes
     multiple implementations are available for a given feature. Refer 
     to site documentation for more options and complete description of 
     each features.
     -->
<httpcollector id="test collector">

  <!-- Variables: Optionally define variables in this configuration file
       using the "set" directive, or by using a file of the same name
       but with the filterExtension ".variables" or ".properties".  Refer 
       to site documentation to find out what each filterExtension does.
       Finally, once can pass an optional properties file when starting the
       crawler.  The following is good practice to reference frequently 
       used classes in a shorter way.
       -->

  #set($core      = "com.norconex.collector.core")
  #set($http      = "com.norconex.collector.http")
  #set($committer = "com.norconex.committer")
  #set($importer  = "com.norconex.importer")

  #set($httpClientFactory = "${http}.client.impl.GenericHttpClientFactory")
  #set($filterExtension   = "${core}.filter.impl.ExtensionReferenceFilter")
  #set($filterRegexRef    = "${core}.filter.impl.RegexReferenceFilter")
  #set($filterRegexMeta   = "${core}.filter.impl.RegexMetadataFilter")
  #set($urlFilter         = "${http}.filter.impl.RegexURLFilter")
  #set($robotsTxt         = "${http}.robot.impl.StandardRobotsTxtProvider")
  #set($robotsMeta        = "${http}.robot.impl.StandardRobotsMetaProvider")
  #set($metaFetcher       = "${http}.fetch.impl.GenericMetadataFetcher")
  #set($docFetcher        = "${http}.fetch.impl.GenericDocumentFetcher")
  #set($linkExtractor     = "${http}.url.impl.GenericLinkExtractor")
  #set($canonLinkDetector = "${http}.url.impl.GenericCanonicalLinkDetector")
  #set($urlNormalizer     = "${http}.url.impl.GenericURLNormalizer")
  #set($sitemapFactory    = "${http}.sitemap.impl.StandardSitemapResolverFactory")
  #set($metaChecksummer   = "${http}.checksum.impl.LastModifiedMetadataChecksummer")
  #set($docChecksummer    = "${core}.checksum.impl.MD5DocumentChecksummer")
  #set($dataStoreFactory  = "${core}.data.store.impl.mapdb.MapDBCrawlDataStoreFactory")
  #set($spoiledStrategy   = "${core}.spoil.impl.GenericSpoiledReferenceStrategizer")

  <!-- Location where internal progress files are stored. -->
  <!-- <progressDir>defined/in/code</progressDir>-->

  <!-- Location where logs are stored. -->
  <!-- <logsDir>defined/in/code</logsDir>-->

  <!-- All crawler configuration options can be specified as default 
       (including start URLs).  Settings defined here will be inherited by 
       all individual crawlers defined further down, unless overwritten.
       If you replace a top level crawler tag from the crawler defaults,
       all the default tag configuration settings will be replace, no 
       attempt will be made to merge or append.
       -->
  <crawlerDefaults>

    <!-- Mandatory starting URL(s) where crawling begins.  If you put more 
         than one URL, they will be processed together.  You can also
         point to one or more URLs files (i.e., seed lists).  -->    
    <startURLs stayOnDomain="true" stayOnPort="false" stayOnProtocol="false">
      <!-- <url>defined/in/code</url> -->
      <!-- <urlsFile>defined/in/code</urlsFile> -->
    </startURLs>

    <!-- Identify yourself to sites you crawl.  It sets the "User-Agent" HTTP 
         request header value.  This is how browsers identify themselves for
         instance.  Sometimes required to be certain values for robots.txt 
         files.
      -->
    <userAgent>Mozilla/5.0 (compatible; TestCrawler/0.1; +http://www.comcepta.com)</userAgent>

    <!-- Optional URL normalization feature. The class must implement
         com.norconex.collector.http.url.IURLNormalizer, 
         like the following class does.
      -->
    <urlNormalizer class="$urlNormalizer">
            <normalizations>lowerCaseSchemeHost, upperCaseEscapeSequence, decodeUnreservedCharacters, removeDefaultPort, removeFragment, removeDotSegments, addTrailingSlash, removeDuplicateSlashes, removeSessionIds, upperCaseEscapeSequence</normalizations> 
      <replacements>
        <replace>
           <match>&amp;view=print</match>
           <replacement>&amp;view=html</replacement>
        </replace>
      </replacements>
    </urlNormalizer>

    <!-- Optional delay resolver defining how polite or aggressive you want
         your crawling to be.  The class must implement 
         com.norconex.collector.http.delay.IDelayResolver.
         The following is the default implementation:
         scope="[crawler|site|thread]
      -->

    <delay default="1000" ignoreRobotsCrawlDelay="false" scope="site" class="${http}.delay.impl.GenericDelayResolver">
          <!-- <schedule dayOfWeek="from Monday to Friday" time="from 8:00 to 6:30">5000</schedule> -->
    </delay>

    <!-- How many threads you want a crawler to use.  Regardless of how many
         thread you have running, the frequency of each URL being invoked
         will remain dictated by the &lt;delay/&gt option above.  Using more
         than one thread is a good idea to ensure the delay is respected
         in case you run into single downloads taking more time than the
         delay specified. Default is 2 threads.
      -->
    <numThreads>2</numThreads>

    <!-- How many level deep can the crawler go. I.e, within how many clicks 
         away from the main page (start URL) each page can be to be considered.
         Beyond the depth specified, pages are rejected.
         The starting URLs all have a zero-depth.  Default is -1 (unlimited)
         -->
    <maxDepth>5</maxDepth>

    <!-- Stop crawling after how many successfully processed documents.  
         A successful document is one that is either new or modified, that was 
         not rejected, not deleted, or did not generate any error.  As an
         example, this is a document that will end up in your search engine. 
         Default is -1 (unlimited)
          -->
    <maxDocuments>-1</maxDocuments>

    <!-- Crawler "work" directory.  This is where files downloaded or created as
         part of crawling activities (besides logs and progress) get stored.
         It should be unique to each crawlers.
         -->
    <!-- <workDir>defined/in/code</workDir> -->

    <!-- Keep downloaded files. Default is false.
         -->
    <keepDownloads>false</keepDownloads>

    <!-- What to do with orphan documents.  Orphans are valid 
         documents, which on subsequent crawls can no longer be reached when 
         running the crawler (e.g. there are no links pointing to that page 
         anymore).  Available options are: 
         IGNORE (default), DELETE, and PROCESS.
         -->
    <orphansStrategy>IGNORE</orphansStrategy>

    <!-- One or more optional listeners to be notified on various crawling
         events (e.g. document rejected, document imported, etc). 
         Class must implement 
         com.norconex.collector.core.event.ICrawlerEventListener
         -->
    <crawlerListeners>
      <!-- <listener class="defined.in.code"/> -->
    </crawlerListeners>

    <!-- Factory class creating a database for storing crawl status and
         other information.  Classes must implement 
         com.norconex.collector.core.data.store.ICrawlURLDatabaseFactory.  
         Default implementation is the following.
         -->
    <crawlDataStoreFactory class="$dataStoreFactory" />

    <!-- Initialize the HTTP client use to make connections.  Classes
         must implement com.norconex.collector.http.client.IHttpClientFactory.
         Default implementation offers many options. The following shows
         a sample use of the default with credentials.
         -->
    <httpClientFactory class="$httpClientFactory">
        <ignoreExternalRedirects>true</ignoreExternalRedirects>
      <!-- These apply to any authentication mechanism -->
      <!--
      <authUsername>myusername</authUsername>
      <authPassword>mypassword</authPassword>
      -->
      <!-- These apply to FORM authentication -->
      <!--
      <authUsernameField>field_username</authUsernameField>
      <authPasswordField>field_password</authPasswordField>
      <authURL>https://www.example.com/login.php</authURL>
      -->
      <!-- These apply to both BASIC and DIGEST authentication -->
      <!--
      <authHostname>www.example.com</authHostname>
      <authPort>80</authPort>
      <authRealm>PRIVATE</authRealm>
      -->
    </httpClientFactory>

    <!-- Optionally filter URL BEFORE any download. Classes must implement 
         com.norconex.collector.core.filter.IReferenceFilter, 
         like the following examples.
         -->
    <referenceFilters>
        <!-- Exclude images, CSS and JS files -->
      <filter class="$filterExtension" onMatch="exclude" caseSensitive="false">jpg,jpeg,gif,png,ico,css,js</filter>
      <!-- Exclude dynamic pages containing a query (?) -->
      <filter class="$filterRegexRef" onMatch="exclude" caseSensitive="true">.+\?.*</filter>
    </referenceFilters>

    <!-- Filter BEFORE download with RobotsTxt rules. Classes must
         implement *.robot.IRobotsTxtProvider.  Default implementation
         is the following.
         -->
    <robotsTxt ignore="false" class="$robotsTxt" />

    <!-- Loads sitemap.xml URLs and adds adds them to URLs to process -->
    <sitemap ignore="false" lenient="true" class="$sitemapFactory" />

    <!-- Fetch a URL HTTP Headers.  Classes must implement
         com.norconex.collector.http.fetch.IHttpMetadataFetcher.  
         The following is a simple implementation.
         -->
    <metadataFetcher class="$metaFetcher">
      <validStatusCodes>200</validStatusCodes>
    </metadataFetcher>

    <!-- Optionally filter AFTER download of HTTP headers.  Classes must 
         implement com.norconex.collector.core.filter.IMetadataFilter.  
         -->
    <metadataFilters>
      <!-- Do not index content-type of CSS or JavaScript -->
      <filter class="$filterRegexMeta" onMatch="exclude" caseSensitive="false" field="Content-Type">.*css.*|.*javascript.*</filter>
    </metadataFilters>        

    <!-- Generates a checksum value from document headers to find out if 
         a document has changed. Class must implement
         com.norconex.collector.core.checksum.IMetadataChecksummer.  
         Default implementation is the following. 
         -->
    <metadataChecksummer disabled="false" keep="false" targetField="collector.checksum-metadata" class="$metaChecksummer" />

        <!-- Detect canonical links. Classes must implement
         com.norconex.collector.http.url.ICanonicalLinkDetector.
         Default implementation is the following.
         -->
    <canonicalLinkDetector ignore="false" class="${canonLinkDetector}">
      <contentTypes>
          text/plain, text/html, application/xhtml+xml, vnd.wap.xhtml+xml, x-asp
      </contentTypes>
    </canonicalLinkDetector>

    <!-- Fetches document.  Class must implement 
         com.norconex.collector.http.fetch.IHttpDocumentFetcher.  
         Default implementation is the following.
         -->
    <documentFetcher class="$docFetcher">
        <validStatusCodes>200</validStatusCodes>
      <notFoundStatusCodes>404</notFoundStatusCodes>
    </documentFetcher>

    <!-- Establish whether to follow a page URLs or to index a given page
         based on in-page meta tag robot information. Classes must implement 
         com.norconex.collector.http.robot.IRobotsMetaProvider.  
         Default implementation is the following.
         -->
    <robotsMeta ignore="false" class="$robotsMeta" />

    <!-- Extract links from a document.  Classes must implement
         com.norconex.collector.http.url.ILinkExtractor. 
         Default implementation is the following.
         -->
    <linkExtractors>
        <extractor class="${linkExtractor}" ignoreExternalLinks="false" maxURLLength="2048" ignoreNofollow="false" keepReferrerData="true" keepFragment="false">
            <contentTypes>text/html, application/xhtml+xml, vnd.wap.xhtml+xml, x-asp</contentTypes>
            <tags>
                <tag name="a" attribute="href" />
                <tag name="frame" attribute="src" />
                <tag name="iframe" attribute="src" />
                <tag name="img" attribute="src" />
                <tag name="meta" attribute="http-equiv" />
                <tag name="base" attribute="href" />
                <tag name="form" attribute="action" />
                <tag name="link" attribute="href" />
                <tag name="input" attribute="src" />
            </tags>
        </extractor>
    </linkExtractors>

    <!--  Optionally filters a document. Classes must implement 
          com.norconex.collector.core.filter.IDocumentFilter-->
    <documentFilters>
        <!-- <filter class="YourClass" />  -->
    </documentFilters>

    <!-- Optionally process a document BEFORE importing it. Classes must
         implement com.norconex.collector.http.doc.IHttpDocumentProcessor.
         -->
    <preImportProcessors>
        <!-- <processor class="YourClass"></processor> -->
    </preImportProcessors>

    <!-- Import a document.  This step calls the Importer module.  The
         importer is a different module with its own set of XML configuration
         options.  Please refer to Importer for complete documentation.
         Below gives you an overview of the main importer tags.
         -->
     <importer>
        <!-- 
        <tempDir>defined/in/code</tempDir>
        <maxFileCacheSize></maxFileCacheSize>
        <maxFilePoolCacheSize></maxFilePoolCacheSize>
        <parseErrorsSaveDir>defined/in/code</parseErrorsSaveDir>
         -->

        <preParseHandlers>
              <tagger class="${importer}.handler.tagger.impl.DocumentLengthTagger" field="document.size.preparse" />

            <!-- These tags can be mixed, in the desired order of execution. -->
                <!-- 
            <tagger class="..." />
            <transformer class="..." />
            <filter class="..." />
            <splitter class="..." />   -->     
        </preParseHandlers>

        <!-- <documentParserFactory class="..." /> -->
      <postParseHandlers>
        <!-- These tags can be mixed, in the desired order of execution. -->

        <!-- follow HTML meta-equiv redirects without indexing original page -->
        <filter class="${importer}.handler.filter.impl.RegexMetadataFilter" onMatch="exclude" property="refresh">.*</filter>

        <!-- Collapse spaces and line feeds -->
        <transformer class="${importer}.handler.transformer.impl.ReduceConsecutivesTransformer" caseSensitive="true">
          <reduce>\s</reduce>
          <reduce>\n</reduce>
          <reduce>\r</reduce>
          <reduce>\t</reduce>
          <reduce>\n\r</reduce>
          <reduce>\r\n</reduce>
          <reduce>\s\n</reduce>
          <reduce>\s\r</reduce>
          <reduce>\s\r\n</reduce>
          <reduce>\s\n\r</reduce>
        </transformer>

        <!-- Remove CSS -->
        <transformer class="${importer}.handler.transformer.impl.ReplaceTransformer" caseSensitive="false">
          <replace>
            <fromValue>class=".*?"</fromValue>
             <toValue></toValue>
          </replace>
        </transformer>
        <transformer class="${importer}.handler.transformer.impl.StripBetweenTransformer" inclusive="true" >
          <stripBetween>
            <start>&lt;style.*?&gt;</start>
            <end>&lt;/style&gt;</end>
          </stripBetween>
          <stripBetween>
            <start>&lt;script.*?&gt;</start>
            <end>&lt;/script&gt;</end>
          </stripBetween>
        </transformer>

        <tagger class="${importer}.handler.tagger.impl.DocumentLengthTagger" field="document.size.postparse" />

        <!-- Reject small documents (<100 Bytes)-->
                <filter class="${importer}.handler.filter.impl.NumericMetadataFilter" onMatch="exclude" field="document.size.postparse" >
                <condition operator="lt" number="100" />
                </filter>

        <!-- Rename fields -->
        <tagger class="${importer}.handler.tagger.impl.RenameTagger">
           <rename fromField="Keywords" toField="keywords" overwrite="true" />
        </tagger>
        <tagger class="${importer}.handler.tagger.impl.DeleteTagger">
            <fields>Keywords</fields>
        </tagger>

        <!-- Unless you configured Solr to accept ANY fields, it will fail
             when you try to add documents. Keep only the metadata fields provided, delete all other ones. -->
        <tagger class="${importer}.handler.tagger.impl.LanguageTagger" shortText="false" keepProbabilities="false" fallbackLanguage="" />
        <tagger class="${importer}.handler.tagger.impl.KeepOnlyTagger">
          <fields>content, title, keywords, description, tags, collector.referrer-reference, collector.depth, document.reference, document.language, document.size.preparse, document.size.postparse</fields>
        </tagger>
      </postParseHandlers>

      <!--
      <responseProcessors>
        <responseProcessor class="..." />
        </responseProcessors>
        -->
    </importer>

    <!-- Create a checksum out of a document to figure out if a document
         has changed, AFTER it has been imported. Class must implement 
         com.norconex.collector.core.checksum.IDocumentChecksummer.
         Default implementation is the following.
         -->
    <documentChecksummer class="$docChecksummer" disabled="false" keep="false" targetField="collector.checksum-doc">
        <sourceFields>content, title, keywords, description</sourceFields>
    </documentChecksummer>

    <!-- Optionally process a document AFTER importing it. Classes must
         implement com.norconex.collector.http.doc.IHttpDocumentProcessor.
         -->
    <postImportProcessors>
        <!-- <processor class="YourClass"></processor>  -->
    </postImportProcessors>

         <!-- Decide what to do with references that have turned bad.
         Class must implement 
         com.norconex.collector.core.spoil.ISpoiledReferenceStrategizer.
         Default implementation is the following.
         -->
    <spoiledReferenceStrategizer class="$spoiledStrategy" fallbackStrategy="DELETE">
        <mapping state="NOT_FOUND"  strategy="DELETE" />
        <mapping state="BAD_STATUS" strategy="GRACE_ONCE" />
        <mapping state="ERROR"      strategy="GRACE_ONCE" />
    </spoiledReferenceStrategizer>

    <!-- Commits a document to a data source of your choice.
         This step calls the Committer module.  The
         committer is a different module with its own set of XML configuration
         options.  Please refer to committer for complete documentation.
         Below is an example using the FileSystemCommitter.
         -->
    <!--
    <committer class="${committer}.core.impl.FileSystemCommitter">
      <directory>${workDir}/crawledFiles</directory>
    </committer>
    -->

    <committer class="com.norconex.committer.solr.SolrCommitter">
      <solrURL>http://localhost:8983/solr/test1</solrURL>
      <solrUpdateURLParams>
        <!-- multiple param tags allowed -->
        <!--<param name="(parameter name)">(parameter value)</param> -->
      </solrUpdateURLParams>
      <commitDisabled>false</commitDisabled>
      <sourceReferenceField keep="false">
         <!-- Optional name of field that contains the document reference, when 
         the default document reference is not used.  The reference value
         will be mapped to Solr "id" field, or the "targetReferenceField" 
         specified.
         Once re-mapped, this metadata source field is 
         deleted, unless "keep" is set to true. -->
         document.reference
      </sourceReferenceField>
      <targetReferenceField>
         <!-- Name of Solr target field where the store a document unique 
         identifier (idSourceField).  If not specified, default is "id". -->
         id 
      </targetReferenceField>
      <!--
      <sourceContentField keep="[false|true]>";
         (If you wish to use a metadata field to act as the document 
         "content", you can specify that field here.  Default 
         does not take a metadata field but rather the document content.
         Once re-mapped, the metadata source field is deleted,
         unless "keep" is set to true.)
      </sourceContentField>
      -->
      <targetContentField>
         <!-- Solr target field name for a document content/body.
          Default is: content -->
          content
      </targetContentField>
      <!-- <queueDir>defined/in/code</queueDir> -->
      <queueSize>10</queueSize>
      <commitBatchSize>5</commitBatchSize>
      <maxRetries>1</maxRetries>
      <maxRetryWait>5000</maxRetryWait>
  </committer>

  </crawlerDefaults>

  <!-- Individual crawlers can be defined here.  All crawler default
       configuration settings will apply to all crawlers created unless 
       explicitly overwritten in crawler configuration.
       For configuration options where multiple items can be present 
       (e.g. filters), the whole list will in crawler defaults would be
       overwritten.
       Since the options are the same as the defaults above, the documentation 
       is not repeated here.
       The only difference from "crawlerDefaults" is the addition of the "id"
       attribute on the crawler tag.  The "id" attribute uniquelly identifies
       each of your crawlers.  
       -->
  <crawlers>
        <crawler id="test crawler">
        <startURLs stayOnDomain="true" stayOnPort="false" stayOnProtocol="false">
            <url>http://www.etools.ch</url>
        </startURLs>
    </crawler>
  </crawlers>

</httpcollector>

Since I posted the config.xml, another question concerning taggers comes into my mind: As you can see, I use a Solr committer and limit the fields to the ones defined with the KeepOnlyTagger. Unfortunately, some test documents contain a field named 'Keywords' instead of 'keywords', but the Solr schema fields are case-sensitive and therefore Solr rejects such documents. The (ugly) workaround is to define a RenameTagger and a DeleteTagger, but I probably missed a better way to simply lowercase all fields that are configured in the KeepOnlyTagger?

About http vs https, you can use the "secureScheme" normalization rule and it will convert all http into https.

Does this also work with sites supporting only http? My idea was more like: if we fetched the same page with the http and https scheme, only https should be indexed.

Thanks for the tip with the RegexReferenceFilter instead of enabling the stayOnDomain attribute. In general it works, but what I see now in the log are sporadic errors or warnings about external references, e.g. erroneous attempts to download the robot.txt or cookie warnings. It seems that with only the RegexReferenceFilter in place, external references don't get imported (they get rejected), but at least the robots.txt of the external reference is fetched. Can I avoid that?

essiembre commented 9 years ago

Forward-slashes being messed up: I was able to reproduce with your config. I will investigate.

Field names of mixed character cases: The CharacterCaseTagger takes care of changing case-sensitivity to your liking for field values, but not field names. It should easy to allow the same thing to happen for field names. I created this feature request: #166.

Favor one scheme when both exists for the same page: There is no way to do this right now since it would require doing a second pass through all crawled URLs when crawling is complete and all URL have been found to find such cases. But by then, unless you create a committer that sends its documents only at the very end of a crawl, your documents will already be sent to Solr (we could send deletion requests, but it would be a ugly hack). Your best bet is to review the site you crawl and see if all http pages can also be asscessed via https and if so simply force the normalization to be https. Otherwise, I suspect implementing this feature could take some time. One idea would be that for each http pages, the crawler could optionally check if an https version exists, but that will be an extra call all the time and not ideal. You can create a new issue for this one and make it a feature request if you really see the benefit, but I am not guaranteeing anything on that one yet.

comschmid commented 9 years ago

Thanks for your investigation!

The CharacterCaseTagger looks like the right place to do that.

Concerning preferences of https vs. http: There is really no need to create a feature request for this specific feature, I just asked you in case I oversaw it. I think I will first check with the seed URL if it is available with https. Like that the chances are good, that the rest of the pages are also available with https.

One last question: I realized that the Solr committer creates a lot of commit queue sub directories ordered by date and time. After a successful run, all directories are empty but stay there. Is there a reason for that, or can I simply delete these directories afterwards? For a huge run, it would be probably better to delete each directory after a successful commit. Maybe this feature could be added to the AbstractBatchCommitter or AbstractMappedCommitter?

essiembre commented 9 years ago

There is a new snapshot with the fix to the URLNormalizer as well as the character case tagger (see ticket #166). Please test and report.

For the Solr Committer, it is supposed to delete empty directories that are old enough (does not risk being written into again). I suppose the last ones written before the crawler ends are never deleted. You can certainly delete them when crawling is over. I suggest you open a separate issue for this either in the Solr Committer project or the Committer Core one.

comschmid commented 9 years ago

Now, the URLNormalizer does not produce anymore invalid URLs, which was sometimes the problem when specifying addTrailingSlash.

Concerning the duplicate pages, I can still see both versions, e.g. http://www.etools.ch vs. http://www.etools.ch/, but only when using http://www.etools.ch as the start URL. I can change the start URL easily myself, which fixes this cornerstone issue.

After investigating the deletion process of old commit queue directories, I have to admit that I didn't refresh or was waiting long enough: the empty directories get deleted after about 10 seconds.

essiembre commented 9 years ago

Maybe the name of the rule is not self-explanatory enough. It will not add a forward slash to EVERY URLs, but only those where the last segment "looks like" a directory. It is mentioned in the doc here.

So think of it as addTrailingSlashToURLsThatLooksLikeDirectories. :-)

Since it is otherwise working for you, I will close this issue. Re-open or create another one if you witness another issue with this one.