Open giorgio79 opened 10 years ago
Hi nice to see some interest in this library :) It was mainly developed to facilitate testing not crawling, so I didn't really have those concerns. All the drivers already support setting user_agent so thats one thing crossed from your list. You can easily add a method to pass arbitrary curl options in the class you referenced, and make a pull request out of it. Connection timeout and robots.txt checking could also be added to other drivers, but that's work that I don't really have time to do ATM, sorry. I will be very appreciative of pull requests though.
Thanks!
nice to see some interest in this library :)
Yes, Spiderling is awesome.
Connection timeout and robots.txt checking could also be added to other drivers, but that's work that I don't really have time to do ATM, sorry. I will be very appreciative of pull requests though.
Meanwhile, I notice there are plenty of robots.txt classes on github... I might just throw sg together and run with it.
Hello,
I would like to pass in user agent, connect timeout etc. with the varioius drivers. Perhaps, also check if robots.txt allows spidering. Such opts can be handled well in curl, I am unaware of the rest.
Re RequestFacory https://github.com/OpenBuildings/spiderling/blob/3f2da1a3bc6b8a7b48639ce159e3668ae65e10b8/src/Openbuildings/Spiderling/Driver/Simple/RequestFactory/HTTP.php