Open virtuman opened 9 years ago
Just to add to the previous comment. Here's a sample output of queue getting backlogged:
output from rabbit queue lookup:
# rabbitmqctl list_queues -p pyspider
Listing queues ...
fetcher2processor 330
newtask_queue 13355
processor2result 0
scheduler2fetcher 0
status_queue 1222
and here's sample of rabbitmq stats screen, looks like it's practically not doing anything:
About the proxies manager, you could use squid. It would rotate though the proxies, auto-retry and isolate the broken ips. You just need to set the proxy of pyspider to squid, it would handle the rest.
Seems the scheduler is the bottleneck now, could you paste a screenshot of newtask_queue and status_queue (in queues panel)?
[root@ logs]# rabbitmqctl status | grep -A 4 file_descriptors {file_descriptors, [{total_limit,3996}, {total_used,492}, {sockets_limit,3594}, {sockets_used,490}]},
Currently queue is overloaded again:
Looks like scheduler is not processing the new tasks.
Try increase the loop_limit
parameter to like 50000 for scheduler. ref.
Is that param processing somewhere? https://github.com/binux/pyspider/search?utf8=✓&q=--loop-limit
I'v changed in code of scheduler.py, lets see
It would change the LOOP_LIMIT of scheduler. https://github.com/binux/pyspider/blob/df73747b2dadc0bfe6595fffcb9b5b9ce24f8d1d/pyspider/run.py#L192
To date I crawled about ~400million unique pages with pyspider, averaging 30000/5mins.
I deploy the python processes separately but to be honest the bottleneck will always be with scheduler and never the other processes or rabbitmq. When I restart my crawl, it takes about 1+ hour for this query to finish "SELECT taskid
,project
,schedule
FROM wwwv1
WHERE status
= 1". But I am using 15K SAS disks in RAID 6 with max compression and its about half a billion records...
But I have a very hard time getting more performance out of scheduler, its using 60% cpu and 60% memory but I cant get performance up.
So its all about the specs and configuration of your database + scheduler, right now the way pyspider works by loading all active tasks in memory will bog down any system when you do large crawls like me.
But 20 sites with 10K pages should be peanuts, 500 sites with 10K pages with a 24hr cycle is also very doable (small taskdb table).
laapsaap thanks for sharing the stats, super impressive. Do you cache scraped content locally before parsing/processing? We broke this down into 2 stages - first is an actual "dumb" scrape, that doesn't do extensive parsing, other than parse the page for "next" links and store scraped page to disk. And then we have a secondary project that reads the file off of disk and does parsing only. This allows us to re-process everything in one shot without having to re-scrape the site if we need to modify parser, etc. and we have a little better knowledge of what is the actual data volumes and ability to separate heavy parsers from quick crawling. With your experience of crawling so many pages - do you see any added benefit in doing things this way - or do you think it's an over-architected approach that offers no performance gains? Our top goal is to "suck-in" and "process" pages as quickly as they become available on the sites that are being scraped. Do you use DOM or REGEX to parse pages? If dom - what engine do you use? whatever comes loaded in binux or something like beautifulSoup or jSoup ?
ROY - doesn't github public repo still have v0.3.3 or how / where can we get v0.4 ? I'd really like to see the stats on average times, it would most definitely be a huge help in attempting to identify the bottlenecks.
Thank you.
Roy, I don't have much experience with configuring squid to do the rotation in the manner that I described; could you point me in the right direction for config options that I should be looking at, I was thinking that it should be something along the lines where each CRAWL project should have a USER ID associated with it and squid should fire off requests using different IPs for each crawler so that it has full range of IPs for each crawler?
I use pyspider to crawl all the pages and do a very simple regex to filter out the information i need and store it in a mysql database. And then I have another script that process the data from resultsdb and saving it in our own tables.
I am pretty sure there are no performance benefits saving pages to disk vs database when you are doing a lot of pages (inode limits). After you process your crawled pages you still need to save the result somewhere right?
I use regex mainly but in some of my scripts also beautiful soup, depends on the site.
What proxies are you using? I proxy all my outgoing requests by using a wrapper on fetcher.
I was already beginning to look at the squid implementation for proxy rotation but seems like it maybe a bit of an overkill for us. We use paid service that gives us a list of few hundreds of password protected proxies for our needs, seems like the proxies are pretty well "equipped" for large utilization. So proxies are stored along with password in a text file, we read it to memory on startup and flip through it for each crawling project... making a wrapper for fetcher would have been an easier approach, but lack of experience of working with python sort of steered us in many "wrong directions" and still need to revisit that part to consolidate proxy management from each project into an extended wrapper - any code samples or hints on how to achieve this would be absolutely AWESOME !
We built a project couple of years ago and saw about 87% performance increase through the "smarter" (consolidated) use of jSoup and regex together, it was java based, but the volumes were pretty significant as well: at least a few million records a day, with 2 hour time window for parse, process and import to production environment where data life expectancy wasn't very high and hardware resources were pretty much unlimited, it was one of the fortune 500 companies and with this project understanding of resource capacity as of yet is pretty significant and important. We were pretty happy with the results, but pyspider now seems to do pretty much all in one, so while a no-brainer when it comes to selecting simplicity over something that was well established - pyspider was an easy choice, however, for many reasons including not being well acquainted with python over the last few year, our learning curve is just a bit steep this time around :)
Scheduler could be the bottleneck in current implement. Scheduler is the only singleton component in the architecture and every operation (load/query duplicate task, insert/update tasks etc...) is executed directly on database.
The design of scheduler is base on my previous project's need. Few pages re-crawled every 5min and discover new pages with age of 10 days. Therefore not many tasks is in active (when task is in active status, it will be load into memory of scheduler). We can do something to improve this:
I would add the "De-duplicate" feature for scheduler in 0.3.x and "Multiple threads" in 0.4.x. But "Separate scheduler" is too complex and hard to deploy (depends on your needs).
ROY - doesn't github public repo still have v0.3.3 or how / where can we get v0.4 ? I'd really like to see the stats on average times, it would most definitely be a huge help in attempting to identify the bottlenecks.
The master branch on github repo is a developing version of pyspider I may have not released a new version for the latest features, you could try it locally via ./run.py
instead of installing them.
could you point me in the right direction for config options that I should be looking at
I use this config line to generate a proxy list for squid.
cache_peer %s parent %s 0 no-query weighted-round-robin weight=%s connect-fail-limit=2 allow-miss max-conn=5""" % (host, port, weight)
I'm using a proxy list crawled from web, therefore many of them are going away after some minutes/hours, the retry and isolate bad address is important for me.
but I don't know whether it's able for you to select which address is used/not-used for one request with squid. If not, it's better handle by yourself.
We built a project couple of years ago and saw about 87% performance increase through the "smarter" (consolidated) use of jSoup and regex together
If the resources is unlimited, as the processor is distributed, you can always increase the processors to solve the bad performance issue of parsing.
We currently scrape approximately 20 sites.
Each of these 10 sites has an average of 1000 to 10000 pages that we scrape, parse and store parsed results to elasticsearch with the TTL age of 10 to 30 minutes.
We already have problems and pyspider fails, but we would really like to be able to scrape 100-500 sites like that.
We would like to be able to scrape all these pages for changes as frequently as possible and discover+scrape+parse+store new pages as soon as they get published on those external websites.
HARDWARE SERVER SPECS
We tried setting this up with all processes except for actual elasticsearch cluster on a single server with the following specs:
RabbitMQ is running on the same server for queue management
STARTUP COMMAND
We use the following command to fire up pyspider to start all processes (our production environment, please correct me if we're not doing it correctly):
all pyspider projects are executed @every 3 minutes age on documents 30 minutes
CURRENT CUSTOMIZATION
We didn't find any packaged proxy management tools that would allow us to rotate through proxies and and we added a couple of methods that basically load full list of proxies and rotate them on each request for each scraping project so that every request comes from a unique IP address and it won't be reused on the same project until it runs through list of all available proxies
PROBLEM
By having all of this setup as described above, it appears that we often overload QUEUE and many processes never end up getting picked up and by then second round of crawlers will have re-scraped pages and re-attempt new parsing.
QUESTIONS
Thank you very much for all your help and suggestions