divkakwani / webcorpus

Generate large textual corpora for almost any language by crawling the web
Other
7 stars 9 forks source link

Distributed Setup #14

Open divkakwani opened 4 years ago

divkakwani commented 4 years ago

Regarding distributed setup, this is what I propose. For this setup, we will need scrapyd, rabbitmq, and a distributed file system (HDFS/seaweedfs)

(1) Adding nodes: whatever node we wanna add, we will have to run scrapyd manually on it. Once scrapyd is up and running, we can control it through scrapyd's HTTP API

(2) The DFS will hold the jobdirs and the crawled data. The jobdirs will be regularly updated by the nodes

(3) Rabbitmq will be our event messenger. The running crawlers will push the events here.

(4) Then we can run the dashboard on any machine. The dashboard will show the crawl statistics obtained through events; it will show a list of live nodes, also obtained through events; we can start/stop crawls by using the scrapyd http api.

More specifically, the starting-a-crawl operation will look like this; \<choose node> \<list of news sources> The crawler will query the DFS to retrieve the latest jobdir, then initiate the crawl

Let's brainstorm over this in the current week and then go ahead with the implementation starting next week.

divkakwani commented 4 years ago

In the discussion I had with Gokul, we concluded that we need to first assess the capabilities of GCP.

This command scrapy bench can be used to benchmark the crawler. I ran it on two different machines and here are the results.

Personal Machine:

{'downloader/request_bytes': 112822,
 'downloader/request_count': 293,
 'downloader/request_method_count/GET': 293,
 'downloader/response_bytes': 723461,
 'downloader/response_count': 293,
 'downloader/response_status_count/200': 293,
 'elapsed_time_seconds': 10.937102,
 'finish_reason': 'closespider_timeout',
 'finish_time': datetime.datetime(2019, 12, 26, 13, 23, 50, 682711),
 'log_count/INFO': 20,
 'memusage/max': 54001664,
 'memusage/startup': 54001664,
 'request_depth_max': 12,
 'response_received_count': 293,
 'scheduler/dequeued': 293,
 'scheduler/dequeued/memory': 293,
 'scheduler/enqueued': 5861,
 'scheduler/enqueued/memory': 5861,
 'start_time': datetime.datetime(2019, 12, 26, 13, 23, 39, 745609)}

My Lab Machine

{'downloader/request_bytes': 274549,
 'downloader/request_count': 599,
 'downloader/request_method_count/GET': 599,
 'downloader/response_bytes': 1915853,
 'downloader/response_count': 599,
 'downloader/response_status_count/200': 599,
 'elapsed_time_seconds': 10.557146,
 'finish_reason': 'closespider_timeout',
 'finish_time': datetime.datetime(2019, 12, 26, 13, 28, 32, 395575),
 'log_count/INFO': 20,
 'memusage/max': 53260288,
 'memusage/startup': 53260288,
 'request_depth_max': 22,
 'response_received_count': 599,
 'scheduler/dequeued': 599,
 'scheduler/dequeued/memory': 599,
 'scheduler/enqueued': 11981,
 'scheduler/enqueued/memory': 11981,
 'start_time': datetime.datetime(2019, 12, 26, 13, 28, 21, 838429)}

@GokulNC I see that you already have a GCP instance running. Can you please post the result of the command for that instance?

I also looked up the pricing of GCP. It's $0.12/GB egress and ingress is free. So network pricing won't be much of an issue with GCP. However, there is still the cost of running the VM.

GokulNC commented 4 years ago

Here's my output for scrapy bench on my GCP VM:

{'downloader/request_bytes': 207643,
 'downloader/request_count': 493,
 'downloader/request_method_count/GET': 493,
 'downloader/response_bytes': 1394394,
 'downloader/response_count': 493,
 'downloader/response_status_count/200': 493,
 'elapsed_time_seconds': 10.666416,
 'finish_reason': 'closespider_timeout',
 'finish_time': datetime.datetime(2019, 12, 26, 14, 30, 2, 366636),
 'log_count/INFO': 20,
 'memusage/max': 51679232,
 'memusage/startup': 51679232,
 'request_depth_max': 18,
 'response_received_count': 493,
 'scheduler/dequeued': 493,
 'scheduler/dequeued/memory': 493,
 'scheduler/enqueued': 9861,
 'scheduler/enqueued/memory': 9861,
 'start_time': datetime.datetime(2019, 12, 26, 14, 29, 51, 700220)}

And sure, that's fine with using GCP costs, we'll discuss about it during our next call.

GokulNC commented 4 years ago

BTW, the number of CPU cores that I used for the above VM is 4. In GCP, it's possible to use a large number of CPU cores per VM, so extensively testing for all those different configurations might be helpful.