TurnerSoftware / InfinityCrawler

A simple but powerful web crawler library for .NET
MIT License
245 stars 36 forks source link

Allow controlling which links are visited #63

Open YairHalberstadt opened 3 years ago

YairHalberstadt commented 3 years ago

For example, I was thinking of using this library to crawl a single site for pages.

This library looks great by the way - much higher quality than any of the other existing crawler libraries I've investigated in C#. Good job!

Turnerj commented 3 years ago

Thanks @YairHalberstadt for the kind words!

Yep, so the library can cover your example - by giving it a URL (the root URL of the site), it will crawl all the pages on the site. It will only crawl additional pages on other sites (eg. subdomains) if you specifically allow it.

Continuing the example from the readme:

using InfinityCrawler;

var crawler = new Crawler();
var result = await crawler.Crawl(new Uri("http://example.org/"), new CrawlSettings {
    UserAgent = "MyVeryOwnWebCrawler/1.0",
    RequestProcessorOptions = new RequestProcessorOptions
    {
        MaxNumberOfSimultaneousRequests = 5
    },
    HostAliases = new [] { "example.net", "subdomain.example.org" }
});

In that example, the domains "example.net" and "subdomain.example.org" will additionally be crawled if (and only if) links are found to them from "example.org".

YairHalberstadt commented 3 years ago

That's great!

Is there anyway to deal with more complex logic? For example, visit all subdomains of this site, but not other sites?

Turnerj commented 3 years ago

Currently there isn't a way to catch-all aliases however that may be a reasonable future addition - probably a wildcard on the HostAlias (eg. "*.example.org"). I've opened #64 to cover adding that feature in a future release.

YairHalberstadt commented 3 years ago

A more general solution might be to accept a Func<Uri, bool> (or whatever) to control which pages are visited.

Turnerj commented 3 years ago

That might be an option however having full flexibility like that can make more simple cases like crawling subdomains more complex. Being able to write, for example *.example.org, is a lot easier than writing the logic manually in C# to support that directly. To a greater extent, I could probably have an allow/block list for paths that also use wildcards rather than someone needing to code that too.

Functionality where you want to control crawling to very specific pages, like what could be achieved with a custom handler, are likely to be quite rare.

Tony20221 commented 1 year ago

I would like to see include/exclude urls using regular expressions. This will allow handling almost everything.

Turnerj commented 1 year ago

I would like to see include/exclude urls using regular expressions. This will allow handling almost everything.

Not that I am committing one way or another but would you want multiple regular expressions for each? Do you want the scheme/host/port separate from the path?

Just want to understand the full scope to achieve a good developer experience. Don't really want lots of repetitive rules etc

Tony20221 commented 1 year ago

It would be a list for each. I don't care about port or scheme since public are mostly using https these days and work off regular port 80. Maybe others find it those useful. But since these are part of the URL and if the regex is working off URLs, it seems to me no extra work is needed.