fredwu / crawler

A high performance web crawler / scraper in Elixir.
939 stars 91 forks source link

High performance: Store gets flooded when too many pages are crawled #28

Closed happysalada closed 12 months ago

happysalada commented 4 years ago

if I try to launch 100 page to be crawled (each with depth 5) after a bit the Store process get flooded and starts dropping messages

2020-02-10 16:07:42.374 [debug] "Failed to fetch https://mystays.rwiths.net/r-withs/tfi0020a.do?GCode=mystays&ciDateY=2020&ciDateM=02&ciDateD=10&coDateY=2020&coDateM=02&coDateD=11&s1=0&s2=0&y1=0&y2=0&y3=0&y4=0&room=1&otona=4&hotelNo=38599&dataPattern=PL&cdHeyaShu=t2&planId=4114101&f_lang=ja, reason: checkout_timeout"

the upside of having a Registry is that the store is global and the crawler can be run from multiple machines. The downside is that this single process will become a bottleneck for high performance. Would you be open to using mnesia ? (fast, distributer, in memory db) if you don't mind the distributed part, I would use an ets for the store, which should be able to handle more load.

the solution to this is to break the crawling of all those urls, and not send them all at the same time.

let me know if you are open to this, I'm open to putting a tentative PR

fredwu commented 4 years ago

Hi, if you could issue a PR that would be awesome! 👍

happysalada commented 4 years ago

the actual problem comes from httpoinson, the underlying library for making the requests. The checkout_time failure, means that the connection pool for making the requests is being flooded edgurgel/httpoison#359

happysalada commented 4 years ago

checking how the library works, by using a Genserver.cast in the worker https://github.com/fredwu/crawler/blob/master/lib/crawler/worker.ex#L20 all the requests are asynchronous, but if since the hackney pool size is limited, the workers won't find an available connection and requests will fail. The surprising thing here, is that the http errors are seen as debug messages. Shouldn't they appear as errors or at least warning? (just wondering) The other surprising behavior is that the user has to figure out looking at the logs, the proper rate limiting to be employed for requests not to fail because of a connection pool error. Perhaps here, making it httpoison configurable so you can pass options to use a particular pool? (not sure what would be the ideal approach here, or if you agree with my reasonsing)

fredwu commented 4 years ago

Hi @happysalada, thanks for doing more investigation! To be honest I haven't had chance to use my library for a while so I don't remember much off the top of my head. I welcome PR fixes! :)

happysalada commented 4 years ago

I'm doing research on what the best options are to pass to hackney. I'll let you know if I find something worth improving. thanks for your reply

fredwu commented 12 months ago

So it's been a few years.... cough

I've just pushed up v1.2.0 to address memory leak.

Also, there's been some updates in httpoison and hackney too: https://github.com/edgurgel/httpoison/issues/414

I couldn't reproduce this issue so I'm assuming it's resolved. Please feel free to reopen if there's more to discuss. :)