Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

(Java) Prevent crawling other domains found in the website #762 #763

Closed nAoxide closed 2 years ago

nAoxide commented 2 years ago

Hi! I looked into your Java examples, and I found it straight forward, but because of the lack of examples, I'm having a difficulty getting to what I exactly need to do with my code. I'm trying to do 2 things: 1- I need to force crawlers to avoid crawling into any domain other than the domain I have specified in the crawler config. 2- I need to access all the URL's that crawlers collect through the process. what I'm trying to implement with your library, is simply: collecting links within the domain I specify in the config. could you, please help me see my way through it?

essiembre commented 2 years ago

Hello @nAoxide,

Which version are you using? I will assume v2.x, even though it should be the same/similar to v3.

1. Restrict to domain:

To force the crawler to remain on the same domains as those part of your start URLs, you can rely on the stayOn... attributes:

<startURLs stayOnDomain="true" includeSubdomains="true" stayOnPort="true" stayOnProtocol="true">
    ...
</startURLs>

Using Java, you can set those this way:

HttpCrawlerConfig crawlerConfig = ...  // <-- However you get your crawler config.
URLCrawlScopeStrategy scopeStrategy = crawlerConfig.getURLCrawlScopeStrategy();
scopeStrategy.setStayOnDomain(true);
scopeStrategy.setStayOn...(true);  // <-- Whatever other settings you want to enable

More information in HttpCrawlerConfig.

The above is usually the simplest. If you need more flexibility, you can also use reference filters. With Java::

ReferenceFilter filter = new ReferenceFilter(TextMatcher.regex(".*/whatever/.*"));
filter.setOnMatch(OnMatch.INCLUDE);
crawlerConfig.setReferenceFilters(Arrays.asList(filter));

2. Get URL listing.

Storing what gets crawled for whatever purpose is usually the job of Committers. Do you care to get the content and other fields or just a list of URLs? Assuming the latter, there are a few ways to do this.

If you just want a text file with all URLs, you could use a file-system committer, like the JSONFileCommitter or the XMLFileCommitter. V3 also has a CSVCommitter.

To only capture the URL and nothing else, you can eliminate the rest in your <importer> configuration with these two:

<transformer class="com.norconex.importer.handler.transformer.impl.SubstringTransformer" end="0"/>
<tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
  <fields>document.reference</fields>
</tagger>

Alternatively, you can omit using a Committer and Importer configuration altogether and instead log all URLs and their crawl status using the URLStatusCrawlerEventListener:

<listener class="com.norconex.collector.http.crawler.event.impl.URLStatusCrawlerEventListener">
  <outputDir>/report/path/</outputDir>
  <fileNamePrefix>url-status</fileNamePrefix>
</listener>

Let me know if this provides you with enough guidance.

nAoxide commented 2 years ago

Thank you for your response! about the second point "Get URL listing", all I need is getting scraped URL's in-app, without the need to store them anywhere on disk, I also do not need to store their contents. I will try to do it using "URLStatusCrawlerEventListener".. I would still appreciate any help I could get from you. I also have a question: when I want to crawl multiple sites, if I added each one of them to a separate "CrawlerConfig" would it be different from adding all of them in one "CrawlerConfig" ?

essiembre commented 2 years ago

You can have as many URLs as you want in the same crawler config. The stayOn... configuration option will apply to each one individually.

If you only care to get URLs programmatically, there are also a few ways. One where you have ultimate control would be to implement a crawler event listener just like the URLStatusCrawlerEventListener does. I recommend you look at it for inspiration. Every time a URL comes in you can either cache it in memory, or process it on the fly. You can do what you want with it at that point really.

If you do not care about the content (other than for extracting links from HTML files), I recommend you do not bother to add a Committer, and also add a document filter (IDocumentFilter) do save yourself a significant amount of content processing.

nAoxide commented 2 years ago

I have already implemented a new URLStatusCrawlerListener as follows:

ICrawlerEventListener listener = new URLStatusCrawlerEventListener() {
            @Override
            public void crawlerEvent(ICrawler crawler, CrawlerEvent event) {

                        //myLogic

                }
            }
        };
         crawlerConfig.setCrawlerListeners(listener);

And I was able to collect URL as I need, I hope I'm doing it the right way!

that would be all I need to know in order to be able to finish my project. And Thank you very much for such a helpful and time saving repository! your work saved me a lot of time.

essiembre commented 2 years ago

I am glad you like it. You will like v3.0, even more, when it is released.

There are no default Committers. I.e., if you do not specify any then it is the same as "disabling" them.

For logging statements, version 2 uses Log4j. You can find a log4j.properties file in your installation directory. You can control what gets logged through that file. If you set the log levels to ERROR everywhere to only get notified of problems. You can also set them to OFF to disable logging.

nAoxide commented 2 years ago

I have edited my last comment. Please, review it again if you may Mr. Essiembre

essiembre commented 2 years ago

There is no single class doing all the logging, it is done everywhere. Eventually, they all go through Log4J so you can look at extending that library if you want to intercept/manipulate logging.

If you just want to be notified of certain events, the ICrawlerEventListener is also the way to go. With it, you can capture a bunch of events. Some of which are crawler lifecycle events. The crawler event that should tell you when a crawler is done, is CRAWLER_FINISHED.

If you have multiple crawlers defined in the same collector and want to know when the entire Collector is done, you can do an equivalent on the collector using ICollectorLifeCycleListener. That listener interface has an onCollectorFinish method.

nAoxide commented 2 years ago

Thank you very much! Now I'm all done. I really appreciate your time and your interaction with my issue.