Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Ignoring Links #531

Closed RBBuff closed 5 years ago

RBBuff commented 5 years ago

This project has been very helpful, but I've got a roadblock that I can't seem to get around. I've been able to configure the crawler to authenticate against a site and then begin to crawl. However, once it hits the "Log out" link that's on the site I can no longer crawl any of the pages. I also don't have a sitemap to work with.

How would I go about ignoring that specific link?

Thanks

essiembre commented 5 years ago

You can add reference filters to your crawler configuration. The RegexReferenceFilter should do it.

Like this (change the regular expression to match your logout URL):

<referenceFilters>
    <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">
      .*/logout/.*
    </filter>
</referenceFilters>
RBBuff commented 5 years ago

Thanks. I've tried this, but it appears to be still hitting the page. Here's the message I see in the output:

INFO [CrawlerEventManager] REJECTED_REDIRECTED: https://domain/site/Log%20Out (HttpFetchResponse [crawlState=REDIRECT, statusCode=302, reasonPhrase=Found (http://domain/exit)])

I've tried the following filters without success:

<referenceFilters>
    <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">
      .*/Log Out/.*
    </filter>
</referenceFilters>
<referenceFilters>
    <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">
      .*/Log%20Out/.*
    </filter>
</referenceFilters>

Any suggestiong?

essiembre commented 5 years ago

Should have worked. Ensure you clean the working directory first, since it may be reprocessed as an "orphan" from a previous run. If you already did this, please attach your config.

RBBuff commented 5 years ago

Thanks, Pascal. I gave it another go this morning with no luck. Below is the config. Thanks again for your time and assistance.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>
<!-- 
   Copyright 2010-2017 Norconex Inc.

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->
<!-- This configuration shows the minimum required and basic recommendations
     to run a crawler.  
     -->
<httpcollector id="Minimum Config HTTP Collector">  
  <!-- Decide where to store generated files. -->
  <progressDir>./examples-output/minimum/progress</progressDir>
  <logsDir>./examples-output/minimum/logs</logsDir>

  <crawlers>
    <crawler id="Norconex Minimum Test Page">
        <httpClientFactory class="com.norconex.collector.http.client.impl.GenericHttpClientFactory">
            <authMethod>form</authMethod>
            <authUsername>someusername</authUsername>
            <authPassword>somepassword</authPassword>

            <authUsernameField>LoginID</authUsernameField>
            <authPasswordField>password</authPasswordField>

            <authURL>https://localhost/login</authURL>
        </httpClientFactory>

      <!-- Requires at least one start URL (or urlsFile). 
           Optionally limit crawling to same protocol/domain/port as 
           start URLs. -->
      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
        <url>https://localhost</url>
      </startURLs>

      <!-- === Recommendations: ============================================ -->

      <!-- Specify a crawler default directory where to generate files. -->
      <workDir>./examples-output/minimum</workDir>

      <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
      <maxDepth>10</maxDepth>

      <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
      <sitemapResolverFactory ignore="true" />

      <!-- Be as nice as you can to sites you crawl. -->
      <delay default="5000" />

      <robotsTxt ignore="true" />

      <referenceFilters>
        <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">
          .*/Log%20Out/.*
        </filter>
    </referenceFilters>

      <!-- Document importing -->
      <importer>
        <postParseHandlers>
          <!-- If your target repository does not support arbitrary fields,
               make sure you only keep the fields you need. -->
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>title,keywords,description,document.reference</fields>
          </tagger>
        </postParseHandlers>
      </importer> 

        <committer class="com.norconex.committer.solr.SolrCommitter">
            <solrURL>http://localhost:8983/solr/pa_test</solrURL>
        </committer>
    </crawler>
  </crawlers>

</httpcollector>
RBBuff commented 5 years ago

Here's a little more information that I hope is helpful in nailing this down. Here you can see that the "Log Out" url is access and doing this performs the logout logic. The "Log Out" page then redirects the user to a "You've been logged out" landing page which is "http://domain_two/exit". The redirect to the "Exit" page is being blocked due to the "stayOnDomain" being set to true. Ultimately, I've got to block the crawler from even attempting to touch the "Log out" page.

INFO  [AbstractCrawler] Norconex Minimum Test Page: 21% completed (7 processed/32 total)
INFO  [CrawlerEventManager]       REJECTED_REDIRECTED: https://domain.com/Log%20Out (HttpFetchResponse [crawlState=REDIRECT, statusCode=302, reasonPhrase=Found (http://domain_two/exit)])
INFO  [CrawlerEventManager]           REJECTED_FILTER: http://domain_two/exit (URLCrawlScopeStrategy[stayOnProtocol=true,stayOnDomain=true,stayOnPort=true])
essiembre commented 5 years ago

Your regular expression expects a forward slash (/) after "Log Out" but your URL does not have any. .../Log%20Out vs .../Log%20Out/.

RBBuff commented 5 years ago

Thanks, Pascal. You're exactly right! The regex tester I was using gave a false positive. Thanks so much for the help.

Krishna210414 commented 5 years ago

I have smilar question if someone can help i want to ignore ?p pattern in the url

tried RegexReferenceFilter its not working consistantly Could anyone can help on this?

I am in 2.7.1 version of Norconex.

RBBuff commented 5 years ago

Does this work for you?

<referenceFilters>
    <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">
      .*[?p].*
    </filter>
</referenceFilters>
Krishna210414 commented 5 years ago

Nope this is ignoring anything with the ? like www.example.com/? is also ignored i wanted specifically ?p to ignored.

RBBuff commented 5 years ago

Try this:

.*[?][p]

Krishna210414 commented 5 years ago

Resolved this by using .\?p_. Thanks for the reply.

sathishavpc commented 5 years ago

Hi Pascal, I am facing issue for huge hit on png & pdf files download when i do crawling my sites. Even though i have filtered the jpg,gif,png,ico,css,js,pdf files , my crawler has keep on hitting all the filtered type of files.

Configuration : `

jpg,gif,png,ico,css,js,pdf
        <filter class="$filterRegexRef">https://www.sample.com/.*</filter>
    </referenceFilters>`

FT: 2019-03-26 12:04:26 INFO - REJECTED_TOO_DEEP: https://www.sample.com/download/removedrealpath.pdf

Could you please help me to filter the unwanted file formats.

Note: Mainly i do not want to hit my server for ( jpg, gif, png, ico, css, js, pdf) format files.

essiembre commented 5 years ago

This ticket is closed, please open new tickets for new issues. See #582