apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
887 stars 262 forks source link

RemoteDriverProtocol Throwing exception in catch and this will lead to misbehaviour of the crawler #1032

Closed msghasan closed 1 year ago

msghasan commented 1 year ago

WHAT

If i have a set of 5 remote phantomjs instances running and if one of them is not working properly this line of code will throw error and my crawler will misbehave and show exceptions in log.

HOW

If we can just handle the exception and not throw it just print an error log so that the crawler do not mismehave.

https://github.com/DigitalPebble/storm-crawler/blob/36a58b5c1b43362a755b73317150a23ba0ad0ddc/core/src/main/java/com/digitalpebble/stormcrawler/protocol/selenium/RemoteDriverProtocol.java#L79

rzo1 commented 1 year ago

If the instances are not ready, it is appropriate to throw an exception IMHO. However, I see two possibilities:

wdyt @jnioche ?

jnioche commented 1 year ago

logging the exception is fine, as long as it is at error level

msghasan commented 1 year ago

@jnioche Can I refactor this class method in two parts the configure method is very big.. can we extract line no 40 t0 59 in a method called getDesiredCapabilities which will return DesiredCapabilities object in the configure().

and line 66 to 80 inside a protected method called setupRemoteDrivers() along with the changes mentioned by @rzo1

jnioche commented 1 year ago

@msghasan I tend to put things in methods only if they are used in more than one place and I don't find the configure here is particularly long, but if you think this adds clarity to the code then why not. As for the changes mentioned by @rzo1 let's just go with (A), Thanks

jnioche commented 1 year ago

Closing because of lack of activity. Please feel free to reopen @msghasan