akamhy / waybackpy

Wayback Machine API interface & a command-line tool
https://pypi.org/project/waybackpy/
MIT License
462 stars 34 forks source link

Not able to run getting error #37

Closed h1hakz closed 3 years ago

h1hakz commented 3 years ago

Hi when running the tool i am getting errors like below. Kindly help.

Screen Shot 2020-11-25 at 4 42 12 PM
akamhy commented 3 years ago

I replaced the urlib with requests pkg and enabled threading. (urlib sucks at handling server side redirects) Install this updated version with pip install git+https://github.com/akamhy/waybackpy.git -U.

waybackpy --url yahoo.com --user_agent "my-user-agent" --known_urls --subdomain --alive will fetch 137K URLs and test all of them, running this command may take long time and also your IP may end up blocked by yahoo.com for sending too many requests.

h1hakz commented 3 years ago

Any other alternative not to get blocked by delaying the request?

On Thu, Nov 26, 2020, 6:30 AM Akash Mahanty notifications@github.com wrote:

I replaced the urlib with requests pkg and enabled threading. (urlib sucks at handling server side redirects) Install this updated version with pip install git+ https://github.com/akamhy/waybackpy.git -U.

waybackpy --url yahoo.com --user_agent "my-user-agent" --known_urls --subdomain --alive will fetch 137K URLs https://web.archive.org/cdx/search/cdx?url=*.yahoo.com/*&output=json&fl=original&collapse=urlkey and test all of them, running this command may take long time and also your IP may end up blocked by yahoo.com for sending too many requests.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/akamhy/waybackpy/issues/37#issuecomment-734014272, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHDQTQU5S5H65BCQUYNBMFDSRWSEVANCNFSM4UCICNLQ .

akamhy commented 3 years ago

Any other alternative not to get blocked by delaying the request?

You can use proxy chains or tor and rotate among a bunch of IPs and they won't be able to block or rate-limit you. Delaying the requests using a queue would take too much time for 137K URLs. Even without delaying it would take you 1.5 days(38.05 hours) to check all the IPs at the rate of 1URL/SEC.

A better solution to reduce the time.

import waybackpy

URL = "yahoo.com"
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 11_0_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.1 Safari/605.1.15"

waybackpy_url_instance = waybackpy.Url(url=URL, user_agent=USER_AGENT)
known_urls = waybackpy_url_instance.known_urls(alive=False, subdomain=True)  # set alive=False as it's too much time consuming for 1 device

known_urls is a python list containing all the 137K URLs, you can break the array into smaller chunks and process these chunks on different machines. For GCP you can use https://cloud.google.com/compute/docs/instances/moving-instance-across-zones to avoid getting rate limited.

h1hakz commented 3 years ago

Hi Akash,

Thanks for your reply.

i tried using proxychains and tor, But not able to rotate the ip's can you help me out.

On Thu, Nov 26, 2020 at 3:19 PM Akash Mahanty notifications@github.com wrote:

Any other alternative not to get blocked by delaying the request?

You can use proxy chains https://github.com/haad/proxychains or tor https://www.torproject.org/ and rotate among a bunch of IPs and they won't be able to block or rate-limit you. Delaying the requests using a queue would take too much time for 137K URLs. Even without delaying it would take you 1.5 days(38.05 hours) to check all the IPs at the rate of 1URL/SEC.

A better solution to reduce the time.

import waybackpy URL = "yahoo.com"USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 11_0_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.1 Safari/605.1.15" waybackpy_url_instance = waybackpy.Url(url=URL, user_agent=USER_AGENT)known_urls = waybackpy_url_instance.known_urls(alive=False, subdomain=True) # set alive=False as it's too much time consuming for 1 device

known_urls is a python list containing all the 137K URLs, you can break the array into smaller chunks and process these chunks on different machines. For GCP you can use https://cloud.google.com/compute/docs/instances/moving-instance-across-zones to avoid getting rate limited.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/akamhy/waybackpy/issues/37#issuecomment-734193762, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHDQTQU3XZ3PWESRQGAB77LSRYQCZANCNFSM4UCICNLQ .

akamhy commented 3 years ago

If you are having trouble with proxy chains watch https://youtu.be/qsA8zREbt6g?t=215 (You don't have to watch the whole 13 min video.)

After you change the conf, it's as simple as: proxychains python3 your-program-name.py

use as many proxies as you could and write the output right after checking the availability, otherwise you many end up wasting too much mem.

akamhy commented 3 years ago

use random mode in proxychains conf