apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
887 stars 262 forks source link

HttpProtocol (both okhttp and apache) race condition while having different proxies in different threads #1247

Open chhsiao90 opened 3 months ago

chhsiao90 commented 3 months ago

What kind of issue is this?

Reproduce steps

To reproduce it, we can run the HttpProtocol main function with many urls with MultiProxyFactory

the crawler.conf

config:
  http.agent.name: test
  http.proxy.manager: org.apache.stormcrawler.proxy.MultiProxyManager
  http.proxy.file: proxies
  http.robots.file.skip: true

the proxies file

http://first:password@proxy1:8888
http://second:password@proxy2:8888

Root cause

The HttpProtocol (both okhttp and apache) is not thread-safe

Example 1 (wrong proxy auth)

Example 2 (wrong proxy used)

jnioche commented 3 months ago

thanks @chhsiao90, are you able to suggest a fix for it?

chhsiao90 commented 3 months ago

Sure, I can have a PR for it.