UTMediaCAT / mediacat-domain-crawler

Internet domain crawler
0 stars 0 forks source link

This README pertains to the crawling aspect of the application. The crawl script(s) would be located in the folder /newCrawler/

Domain crawler take crawl domains such as https://www.nytimes.com/ or an crawl_scope.csv that contains those domains.

Filter: Crawler will only crawl URL that match one domain from crawl_scope.csv or input domains. (ie: https://www.nytimes.com/some/page)

At the end of the crawl, it can notify by email whether the crawl stopped or not. The email aspect must be set up first before the crawl. (or ignored if not desired, it is commented out by default)

Input credentials in the crawl.js script under the transporter constant here

PLEASE do not EVER commit your password. As a future issue, we should probably make a seperate constant file that is git ignored.

prereqs

cd newCrawler to get to the masterCrawler.py script

npm install to install node dependencies

If on Arbutus server: npm i puppeteer@13 to install puppeteer version 13.7.0

run the master crawler

This script is run in a similar fashion as the other crawlers but receives an extra flag corresponding to the time period for which the crawler should be run after which the script will restart the crawler. This is to avoid browser timeouts and stack memory issues that are encountered on running the crawler for too long (> 30 hours). Here, the -t flag takes time in minutes.

Note: You need to run master crawler with one other crawler, the example below run masterCrawler with batchCrawler

python3 masterCrawler.py batchCrawl.js -l https://www.nytimes.com/ -t 300

For Graham Instance, run python3 masterCrawler.py batchCrawl.js -n 1000 -m 20000 -l https://example.com/ -t 240 should optimize the crawler speed

Run the batch crawler

node batchCrawl.js -f ../../../mediacat-hidden/domain.csv

Run the NYTimes archive crawler

node nyCrawl.js -n 5000 -f full_scope.csv

NYTimes archive crawler will crawl the given search archive here: https://www.nytimes.com/search?dropmab=true&query=&sort=newest and repeatedly clicking show more buttom then scroll down untill there is no more show more bottom.

Cautious version (has stealth option to crawl slower, sleeps for a short time after each request, stops if too many failed requests):

node nyCrawlcautious.js -n 200 -stealth 5000 -f full_scope.csv

Run the puppeteer crawler or cheerio crawler (not yet functional)

node --max-old-space-size=7168 crawl.js -f ../../../mediacat-hidden/domain.csv -n inf

node --max-old-space-size=7168 crawlCheerio.js -f ../../../mediacat-hidden/domain.csv -n inf

Ouput

The output URL jsons will be stored under /Results/https_example_com/.

Apify tips

When using Apify, it is important to know that when the crawler needs to be rerun without the previous queue, the apify_storage needs to deleted before running. Otherwise, it will continue from where it left off in the queue. Apify queue will be under /newCrawler/apify_storage/

Crawl in Stealth

You might get error like 403 Forbidden or 429 too many requests during the crawl, especially for some small domain.

It is very likely that you crawl too fast so the domain blcok you. There are few strategies you can use to avoid this problem.

testing

this script has been written to time how long it takes for the crawlers to crawl through a certain number of links. The user will have to uncomment or comment in which tests to run on main().

testTime1 or nytimes is the first of a chain of tests for the puppeteer crawler

similiarily, for testTime1Cheerio and nytimesCheerio

monitoring the results

Instructions to monitor the results of the crawl are in the readme in the directory monitor

Restarting the Crawl

A stopped or failed crawl can be restarted from where it left off by running the crawler again in the same directory. The apify crawler will use the requests stored in newCrawler/apify_storage/request_queues to restart the crawl.

If you want to restart a crawl from scratch, delete the newCrawler/apify_storage/ directory.

forever.js

a script that helps restart the crawl if needed

Combining results

rsync -a ./path_to_individual_source_folder/ ./path_to_destination_combined_folder/

The above command automatically keeps the version with the latest modified time for two files with the same name.