OpenBuildings / spiderling

Browse html pages with php, selenium and phantomjs
https://github.com/OpenBuildings/spiderling
BSD 3-Clause "New" or "Revised" License
58 stars 17 forks source link

Ability to pass in user agent header, connection timeout etc.? #2

Open giorgio79 opened 10 years ago

giorgio79 commented 10 years ago

Hello,

I would like to pass in user agent, connect timeout etc. with the varioius drivers. Perhaps, also check if robots.txt allows spidering. Such opts can be handled well in curl, I am unaware of the rest.

Re RequestFacory https://github.com/OpenBuildings/spiderling/blob/3f2da1a3bc6b8a7b48639ce159e3668ae65e10b8/src/Openbuildings/Spiderling/Driver/Simple/RequestFactory/HTTP.php

ivank commented 10 years ago

Hi nice to see some interest in this library :) It was mainly developed to facilitate testing not crawling, so I didn't really have those concerns. All the drivers already support setting user_agent so thats one thing crossed from your list. You can easily add a method to pass arbitrary curl options in the class you referenced, and make a pull request out of it. Connection timeout and robots.txt checking could also be added to other drivers, but that's work that I don't really have time to do ATM, sorry. I will be very appreciative of pull requests though.

giorgio79 commented 10 years ago

Thanks!

nice to see some interest in this library :)

Yes, Spiderling is awesome.

Connection timeout and robots.txt checking could also be added to other drivers, but that's work that I don't really have time to do ATM, sorry. I will be very appreciative of pull requests though.

Meanwhile, I notice there are plenty of robots.txt classes on github... I might just throw sg together and run with it.