What is correct infrastructure for the use-case?

virtuman commented 9 years ago

We currently scrape approximately 20 sites.

Each of these 10 sites has an average of 1000 to 10000 pages that we scrape, parse and store parsed results to elasticsearch with the TTL age of 10 to 30 minutes.

We already have problems and pyspider fails, but we would really like to be able to scrape 100-500 sites like that.

We would like to be able to scrape all these pages for changes as frequently as possible and discover+scrape+parse+store new pages as soon as they get published on those external websites.

HARDWARE SERVER SPECS

We tried setting this up with all processes except for actual elasticsearch cluster on a single server with the following specs:

Linux CentOS 6.x distro with all latest updates
Python 2.6 / 2.7.9 / 3.4 -- tried them all, have all 3 installed
128GB RAM
Multiple CPUs Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz - 32 cores total
RAID1 SSD disk (local) + RAID1 scrape result storage disk (local)
1 gigabit dedicated line with guaranteed bandwidth of at least 500mbp/s used specifically for scraping only

RabbitMQ is running on the same server for queue management screen shot 2015-04-14 at 12 40 42 pm

STARTUP COMMAND

We use the following command to fire up pyspider to start all processes (our production environment, please correct me if we're not doing it correctly):

/usr/local/bin/python2.7 /usr/local/bin/pyspider -c ./config.json --queue-maxsize 20000 all --processor-num 100 --fetcher-num 37 --result-worker-num 10 --run-in subprocess

all pyspider projects are executed @every 3 minutes age on documents 30 minutes

CURRENT CUSTOMIZATION

We didn't find any packaged proxy management tools that would allow us to rotate through proxies and and we added a couple of methods that basically load full list of proxies and rotate them on each request for each scraping project so that every request comes from a unique IP address and it won't be reused on the same project until it runs through list of all available proxies

PROBLEM

By having all of this setup as described above, it appears that we often overload QUEUE and many processes never end up getting picked up and by then second round of crawlers will have re-scraped pages and re-attempt new parsing.

QUESTIONS

The startup command that we use - is it correct to use it this way?
Is RabbitMQ resource exhaustive? Web UI console is showing practically no load, and my assumption is that pyspider spends more time on processing / parsing than crawling. Should we move RabbitMQ to separate server/cluster? Would that result in noticeable improvements?
Can scheduler be moved out to a separate server? Is it resource intensive on its own or is it low-resource-utilization part of pyspider?
Can or should we move any other parts / modules of pySpider to separate servers to see significant improvements?
Any other suggestions / recommendations that we should consider in order to optimize and improve scrape/parse speed so that it at least finishes tasks sometimes?
Does pySpider have an ability to auto-re-scrape any expired page so that we don't need to re-add it to the queue manually? (currently all our projects originate on home page and will run through all available links, but if the link was removed from the website - we don't have an easy way to track what happened to that page, whether the product that was associated with that page is discontinued/no longer available or not since we no longer get updates on the page that is no longer found via pages that we scrape in the init scripts)?

Thank you very much for all your help and suggestions

virtuman commented 9 years ago

Just to add to the previous comment. Here's a sample output of queue getting backlogged:

output from rabbit queue lookup:

# rabbitmqctl list_queues -p pyspider
Listing queues ...
fetcher2processor   330
newtask_queue   13355
processor2result    0
scheduler2fetcher   0
status_queue    1222

and here's sample of rabbitmq stats screen, looks like it's practically not doing anything:

screen shot 2015-04-14 at 8 54 17 pm

binux commented 9 years ago

It's recommended to run components in separate processes, please refer to Deployment.
With the latest git version, you can check current average process time of a project. It may help you to figure out the bottlenecks and the numbers of components you need.
I think your server is powerful enough for your task. But we can look through this issue if there is a bottleneck.
same to 3
refer to 2&3&4
No, there is no such mechanism. But I can add that recently.

About the proxies manager, you could use squid. It would rotate though the proxies, auto-retry and isolate the broken ips. You just need to set the proxy of pyspider to squid, it would handle the rest.

binux commented 9 years ago

Seems the scheduler is the bottleneck now, could you paste a screenshot of newtask_queue and status_queue (in queues panel)?

horrower commented 9 years ago

screen shot 2015-04-15 at 12 29 36

horrower commented 9 years ago

[root@ logs]# rabbitmqctl status | grep -A 4 file_descriptors {file_descriptors, [{total_limit,3996}, {total_used,492}, {sockets_limit,3594}, {sockets_used,490}]},

horrower commented 9 years ago

Currently queue is overloaded again: screen shot 2015-04-15 at 15 24 55

binux commented 9 years ago

Looks like scheduler is not processing the new tasks. Try increase the loop_limit parameter to like 50000 for scheduler. ref.

horrower commented 9 years ago

Is that param processing somewhere? https://github.com/binux/pyspider/search?utf8=✓&q=--loop-limit

I'v changed in code of scheduler.py, lets see

binux commented 9 years ago

It would change the LOOP_LIMIT of scheduler. https://github.com/binux/pyspider/blob/df73747b2dadc0bfe6595fffcb9b5b9ce24f8d1d/pyspider/run.py#L192

laapsaap commented 9 years ago

To date I crawled about ~400million unique pages with pyspider, averaging 30000/5mins.

I deploy the python processes separately but to be honest the bottleneck will always be with scheduler and never the other processes or rabbitmq. When I restart my crawl, it takes about 1+ hour for this query to finish "SELECT taskid,project,schedule FROM wwwv1 WHERE status = 1". But I am using 15K SAS disks in RAID 6 with max compression and its about half a billion records...

But I have a very hard time getting more performance out of scheduler, its using 60% cpu and 60% memory but I cant get performance up.

So its all about the specs and configuration of your database + scheduler, right now the way pyspider works by loading all active tasks in memory will bog down any system when you do large crawls like me.

But 20 sites with 10K pages should be peanuts, 500 sites with 10K pages with a 24hr cycle is also very doable (small taskdb table).

virtuman commented 9 years ago

laapsaap thanks for sharing the stats, super impressive. Do you cache scraped content locally before parsing/processing? We broke this down into 2 stages - first is an actual "dumb" scrape, that doesn't do extensive parsing, other than parse the page for "next" links and store scraped page to disk. And then we have a secondary project that reads the file off of disk and does parsing only. This allows us to re-process everything in one shot without having to re-scrape the site if we need to modify parser, etc. and we have a little better knowledge of what is the actual data volumes and ability to separate heavy parsers from quick crawling. With your experience of crawling so many pages - do you see any added benefit in doing things this way - or do you think it's an over-architected approach that offers no performance gains? Our top goal is to "suck-in" and "process" pages as quickly as they become available on the sites that are being scraped. Do you use DOM or REGEX to parse pages? If dom - what engine do you use? whatever comes loaded in binux or something like beautifulSoup or jSoup ?

ROY - doesn't github public repo still have v0.3.3 or how / where can we get v0.4 ? I'd really like to see the stats on average times, it would most definitely be a huge help in attempting to identify the bottlenecks.

Thank you.

virtuman commented 9 years ago

Roy, I don't have much experience with configuring squid to do the rotation in the manner that I described; could you point me in the right direction for config options that I should be looking at, I was thinking that it should be something along the lines where each CRAWL project should have a USER ID associated with it and squid should fire off requests using different IPs for each crawler so that it has full range of IPs for each crawler?

laapsaap commented 9 years ago

I use pyspider to crawl all the pages and do a very simple regex to filter out the information i need and store it in a mysql database. And then I have another script that process the data from resultsdb and saving it in our own tables.

I am pretty sure there are no performance benefits saving pages to disk vs database when you are doing a lot of pages (inode limits). After you process your crawled pages you still need to save the result somewhere right?

I use regex mainly but in some of my scripts also beautiful soup, depends on the site.

What proxies are you using? I proxy all my outgoing requests by using a wrapper on fetcher.

virtuman commented 9 years ago

I was already beginning to look at the squid implementation for proxy rotation but seems like it maybe a bit of an overkill for us. We use paid service that gives us a list of few hundreds of password protected proxies for our needs, seems like the proxies are pretty well "equipped" for large utilization. So proxies are stored along with password in a text file, we read it to memory on startup and flip through it for each crawling project... making a wrapper for fetcher would have been an easier approach, but lack of experience of working with python sort of steered us in many "wrong directions" and still need to revisit that part to consolidate proxy management from each project into an extended wrapper - any code samples or hints on how to achieve this would be absolutely AWESOME !

We built a project couple of years ago and saw about 87% performance increase through the "smarter" (consolidated) use of jSoup and regex together, it was java based, but the volumes were pretty significant as well: at least a few million records a day, with 2 hour time window for parse, process and import to production environment where data life expectancy wasn't very high and hardware resources were pretty much unlimited, it was one of the fortune 500 companies and with this project understanding of resource capacity as of yet is pretty significant and important. We were pretty happy with the results, but pyspider now seems to do pretty much all in one, so while a no-brainer when it comes to selecting simplicity over something that was well established - pyspider was an easy choice, however, for many reasons including not being well acquainted with python over the last few year, our learning curve is just a bit steep this time around :)

binux commented 9 years ago

Scheduler could be the bottleneck in current implement. Scheduler is the only singleton component in the architecture and every operation (load/query duplicate task, insert/update tasks etc...) is executed directly on database.

The design of scheduler is base on my previous project's need. Few pages re-crawled every 5min and discover new pages with age of 10 days. Therefore not many tasks is in active (when task is in active status, it will be load into memory of scheduler). We can do something to improve this:

De-duplicate new tasks before query/insert/update into database. (When extracting links from some related pages of same site may get many duplicate links, hold the links for x seconds/minutes and de-duplicate them before query in database would have better performance).
Multiple threads/processes of scheduler. (Use multiple threads to handle the new tasks and query/insert them into database, would have better performance in modern database (except sqlite))
Separate scheduler. (It's the killer solution for the bottleneck of scheduler. But 1. how to separate tasks as them have to be flow controlled, 2. how to get the status of one project from all schedulers.)

I would add the "De-duplicate" feature for scheduler in 0.3.x and "Multiple threads" in 0.4.x. But "Separate scheduler" is too complex and hard to deploy (depends on your needs).

ROY - doesn't github public repo still have v0.3.3 or how / where can we get v0.4 ? I'd really like to see the stats on average times, it would most definitely be a huge help in attempting to identify the bottlenecks.

The master branch on github repo is a developing version of pyspider I may have not released a new version for the latest features, you could try it locally via ./run.py instead of installing them.

could you point me in the right direction for config options that I should be looking at

I use this config line to generate a proxy list for squid.

cache_peer %s parent %s 0 no-query weighted-round-robin weight=%s connect-fail-limit=2 allow-miss max-conn=5""" % (host, port, weight)

ref

I'm using a proxy list crawled from web, therefore many of them are going away after some minutes/hours, the retry and isolate bad address is important for me.

but I don't know whether it's able for you to select which address is used/not-used for one request with squid. If not, it's better handle by yourself.

We built a project couple of years ago and saw about 87% performance increase through the "smarter" (consolidated) use of jSoup and regex together

If the resources is unlimited, as the processor is distributed, you can always increase the processors to solve the bad performance issue of parsing.

binux / pyspider