Open HenryLeongStat opened 6 years ago
For pic_crawler.py
, it is used to get the url
of the pictures in a website.
Example:
manchongleong@Mande-MacBook-Pro:~/Documents/GitHub/CEHI-Web-crawler-for-residential-history$ python pic_clawer.py -url https://cehi.rice.edu/about/staff
/sites/g/files/bxs176/themes/site/images/rice-logo.jpg
https://cehi.rice.edu/sites/g/files/bxs176/themes/site/logo.png
https://cehi.rice.edu/sites/g/files/bxs176/f/styles/sdefault/public/mlm_old_0.jpg?itok=ElFCt4xx
https://cehi.rice.edu/sites/g/files/bxs176/f/styles/sdefault/public/RAnthopolos_0.jpg?itok=hiURPDSW
https://cehi.rice.edu/sites/g/files/bxs176/f/styles/sdefault/public/kbergen_photo.jpg?itok=jEXkkwOM
https://cehi.rice.edu/sites/g/files/bxs176/f/styles/sdefault/public/MercedesBravo.jpg?itok=Vj9gnlz1
https://cehi.rice.edu/sites/g/files/bxs176/f/styles/sdefault/public/BG_final_0_0.jpg?itok=db9wLgZH
https://cehi.rice.edu/sites/g/files/bxs176/f/styles/sdefault/public/Jocelyn_0.jpg?itok=ibbpN9YR
https://cehi.rice.edu/sites/g/files/bxs176/f/styles/sdefault/public/Hien_Bio_Sketch_Picture_0.jpg?itok=oe-IIDAQ
https://cehi.rice.edu/sites/g/files/bxs176/f/styles/sdefault/public/Yang_0.jpg?itok=Xu9IIj8z
https://cehi.rice.edu/sites/g/files/bxs176/f/styles/sdefault/public/brian_0.jpg?itok=aRvSr0Zo
https://cehi.rice.edu/sites/g/files/bxs176/f/styles/sdefault/public/claire_0.jpg?itok=aDALRHvT
https://cehi.rice.edu/sites/g/files/bxs176/f/styles/sdefault/public/Joshua-bw-nowb.jpg?itok=9zeGulpB
Links are somethings elemental when doing the reclusive web crawling.
Currently, the script is written as
CLI (command line interface)
. If you want to try the script, open terminal, and then run the script like the followings:python arg_clawer.py -url http://xxxxx
For example:
The output will be like: