Closed RBBuff closed 6 years ago
You can add reference filters to your crawler configuration. The RegexReferenceFilter should do it.
Like this (change the regular expression to match your logout URL):
<referenceFilters>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">
.*/logout/.*
</filter>
</referenceFilters>
Thanks. I've tried this, but it appears to be still hitting the page. Here's the message I see in the output:
INFO [CrawlerEventManager] REJECTED_REDIRECTED: https://domain/site/Log%20Out (HttpFetchResponse [crawlState=REDIRECT, statusCode=302, reasonPhrase=Found (http://domain/exit)])
I've tried the following filters without success:
<referenceFilters>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">
.*/Log Out/.*
</filter>
</referenceFilters>
<referenceFilters>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">
.*/Log%20Out/.*
</filter>
</referenceFilters>
Any suggestiong?
Should have worked. Ensure you clean the working directory first, since it may be reprocessed as an "orphan" from a previous run. If you already did this, please attach your config.
Thanks, Pascal. I gave it another go this morning with no luck. Below is the config. Thanks again for your time and assistance.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>
<!--
Copyright 2010-2017 Norconex Inc.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<!-- This configuration shows the minimum required and basic recommendations
to run a crawler.
-->
<httpcollector id="Minimum Config HTTP Collector">
<!-- Decide where to store generated files. -->
<progressDir>./examples-output/minimum/progress</progressDir>
<logsDir>./examples-output/minimum/logs</logsDir>
<crawlers>
<crawler id="Norconex Minimum Test Page">
<httpClientFactory class="com.norconex.collector.http.client.impl.GenericHttpClientFactory">
<authMethod>form</authMethod>
<authUsername>someusername</authUsername>
<authPassword>somepassword</authPassword>
<authUsernameField>LoginID</authUsernameField>
<authPasswordField>password</authPasswordField>
<authURL>https://localhost/login</authURL>
</httpClientFactory>
<!-- Requires at least one start URL (or urlsFile).
Optionally limit crawling to same protocol/domain/port as
start URLs. -->
<startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
<url>https://localhost</url>
</startURLs>
<!-- === Recommendations: ============================================ -->
<!-- Specify a crawler default directory where to generate files. -->
<workDir>./examples-output/minimum</workDir>
<!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
<maxDepth>10</maxDepth>
<!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
<sitemapResolverFactory ignore="true" />
<!-- Be as nice as you can to sites you crawl. -->
<delay default="5000" />
<robotsTxt ignore="true" />
<referenceFilters>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">
.*/Log%20Out/.*
</filter>
</referenceFilters>
<!-- Document importing -->
<importer>
<postParseHandlers>
<!-- If your target repository does not support arbitrary fields,
make sure you only keep the fields you need. -->
<tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
<fields>title,keywords,description,document.reference</fields>
</tagger>
</postParseHandlers>
</importer>
<committer class="com.norconex.committer.solr.SolrCommitter">
<solrURL>http://localhost:8983/solr/pa_test</solrURL>
</committer>
</crawler>
</crawlers>
</httpcollector>
Here's a little more information that I hope is helpful in nailing this down. Here you can see that the "Log Out" url is access and doing this performs the logout logic. The "Log Out" page then redirects the user to a "You've been logged out" landing page which is "http://domain_two/exit". The redirect to the "Exit" page is being blocked due to the "stayOnDomain" being set to true. Ultimately, I've got to block the crawler from even attempting to touch the "Log out" page.
INFO [AbstractCrawler] Norconex Minimum Test Page: 21% completed (7 processed/32 total)
INFO [CrawlerEventManager] REJECTED_REDIRECTED: https://domain.com/Log%20Out (HttpFetchResponse [crawlState=REDIRECT, statusCode=302, reasonPhrase=Found (http://domain_two/exit)])
INFO [CrawlerEventManager] REJECTED_FILTER: http://domain_two/exit (URLCrawlScopeStrategy[stayOnProtocol=true,stayOnDomain=true,stayOnPort=true])
Your regular expression expects a forward slash (/
) after "Log Out" but your URL does not have any. .../Log%20Out
vs .../Log%20Out/
.
Thanks, Pascal. You're exactly right! The regex tester I was using gave a false positive. Thanks so much for the help.
I have smilar question if someone can help i want to ignore ?p pattern in the url
tried RegexReferenceFilter its not working consistantly Could anyone can help on this?
I am in 2.7.1 version of Norconex.
Does this work for you?
<referenceFilters>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">
.*[?p].*
</filter>
</referenceFilters>
Nope this is ignoring anything with the ? like www.example.com/? is also ignored i wanted specifically ?p to ignored.
Try this:
.*[?][p]
Resolved this by using .\?p_. Thanks for the reply.
Hi Pascal, I am facing issue for huge hit on png & pdf files download when i do crawling my sites. Even though i have filtered the jpg,gif,png,ico,css,js,pdf files , my crawler has keep on hitting all the filtered type of files.
Configuration :
`
<filter class="$filterRegexRef">https://www.sample.com/.*</filter>
</referenceFilters>`
FT: 2019-03-26 12:04:26 INFO - REJECTED_TOO_DEEP: https://www.sample.com/download/removedrealpath.pdf
Could you please help me to filter the unwanted file formats.
Note: Mainly i do not want to hit my server for ( jpg, gif, png, ico, css, js, pdf) format files.
This ticket is closed, please open new tickets for new issues. See #582
This project has been very helpful, but I've got a roadblock that I can't seem to get around. I've been able to configure the crawler to authenticate against a site and then begin to crawl. However, once it hits the "Log out" link that's on the site I can no longer crawl any of the pages. I also don't have a sitemap to work with.
How would I go about ignoring that specific link?
Thanks