crwlrsoft / crawler

Library for Rapid (Web) Crawler and Scraper Development
https://www.crwlr.software/packages/crawler
MIT License
325 stars 11 forks source link

Use proxy #99

Closed vladsvd closed 1 year ago

vladsvd commented 1 year ago

thanks for the great library. I did not find how to use a proxy, is it possible?

otsch commented 1 year ago

Hey @vladsvd 👋 currently it's not possible, but it's on my roadmap for the next months. Sidenote: as the library uses the PSR-18 ClientInterface and neither the Client, nor the PSR-7 Request have any functionality dedicated to proxying, it'll only be possible using the default guzzle client or the headless browser.

ruerdev commented 1 year ago

@otsch Hi! Just curious: did you already have a look at this? I will probably need this very soon, so otherwise I will start implementing it myself. Thanks :)

otsch commented 1 year ago

Hey @ruerdev 👋 I thought I'll implement it as part of Hacktoberfest 😅 the hacktoberfest.com website says it already starts in 6 days.

If that's too late for you, you can pass your own guzzle instance with a proxy to your crawler's loader like this:

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Loader\Http\HttpLoader;
use Crwlr\Crawler\Loader\LoaderInterface;
use Crwlr\Crawler\UserAgents\UserAgent;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
use GuzzleHttp\Client;
use Psr\Log\LoggerInterface;

class MyCrawler extends HttpCrawler
{
    protected function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface|array
    {
        $httpClient = new Client(['proxy' => 'http://your-proxy/';]);

        return new HttpLoader($userAgent, $httpClient);
    }

    protected function userAgent(): UserAgentInterface
    {
        return new UserAgent('Mozilla/5.0 (compatible)');
    }
}

If you want to use multiple proxies and rotate them, I think it'd be best to wait for the implementation. Or would you maybe want to try to implement it in the library? Also: do you want to use it with a normal HTTP client, or with the headless chrome browser?

ruerdev commented 1 year ago

Hi @otsch

Thanks for your quick reply and example! I am actually already using that method, I do indeed need a little bit more control over the proxy that is used. I want to use different proxy locations and rotating proxies for the URLs I might crawl on a website. I am using the HTTP client for now, but will also use the headless browser in some cases in the future.

It's fine for me to wait a little bit. I was just checking if it was something you are willing to spend some time on :)

otsch commented 1 year ago

@ruerdev so, the feature is on it's way: https://github.com/crwlrsoft/crawler/pull/120

With that you can then do:

$crawler = HttpCrawler::make()->withUserAgent('MyUserAgent');

$crawler->getLoader()->useProxy('http://127.0.0.1:8001');

// or

$crawler->getLoader()->useRotatingProxies([
    'http://127.0.0.1:8001',
    'http://127.0.0.1:8002',
    'http://127.0.0.1:8003',
]);

useRotatingProxies() iterates through the defined proxies. Works both with guzzle HTTP client and with the headless chrome browser. Does that fit your needs?

ruerdev commented 1 year ago

@otsch That should work great, thanks! The only suggestion I have to enhance it further is to include the option to retry a request using a different proxy in case of a failure.

otsch commented 1 year ago

🤔 OK, thanks. I'd say this wouldn't be a feature especially for the proxy use-case, but generally. There is already a feature that does retry when receiving special error responses (429 too many requests and 503 service unavailable) https://www.crwlr.software/packages/crawler/v1.1/the-crawler/politeness#wait-and-retry It is handled by the RetryErrorResponseHandler of the HttpLoader https://github.com/crwlrsoft/crawler/blob/main/src/Loader/Http/Politeness/RetryErrorResponseHandler.php Maybe we could make this a bit more flexible to make it possible to have automatic retries also for all other error response codes (with shorter wait times than for the 429 and 503 responses). Doing the retry with a different proxy would automatically be the case. But I'll create a new issue for this topic and close this one for now.