TheLethalCode / Artemis-arrow

14 stars 31 forks source link

Speed up anime list scraping #33

Closed TheLethalCode closed 5 years ago

TheLethalCode commented 5 years ago

Right now, the scraping of all anime details will take around 10 hours on a rough estimate. We have to speed it up. One possible option is to use multiprocessing

prashantramnani commented 5 years ago

By using multiprocessing you can't increase the speed by more than 2 times. making it more than 2 will give the 429 error while making the http requests i.e too many requests at the same time.

TheLethalCode commented 5 years ago

Are you sure, because I am working on a project which uses multiprocessing for brute forcing a password. See this for reference, http://blog.adnansiddiqi.me/how-to-speed-up-your-python-web-scraper-by-using-multiprocessing/ .

prashantramnani commented 5 years ago

I tried that as well, still it says too many requests screenshot from 2018-12-13 07-37-24

TheLethalCode commented 5 years ago

Dude, I am not even sure what you did, but I just got it working in my laptop, and hence I know it is possible. You might have overlapping arguments

prashantramnani commented 5 years ago

What do you mean by overlapping arguments? And what did you do exactly?

TheLethalCode commented 5 years ago

My bad. The server is throttling the connection. The only way is to change the IP which is too much of a work. If you want to try it, have a look. But since this is a one time script, we will just run it through the 10 hours

prashantramnani commented 5 years ago

I can do it, but it would be difficult to make it work on any wifi over campus. I can make it work over mobile data. Should I do it?

TheLethalCode commented 5 years ago

As I said, it's overkill for a one-time script running.

prashantramnani commented 5 years ago

umm..making it work over mobile data isn't that much of a work. Anyway it's fine for running it 10 hours. I guess there'll be errors as the script uses "soup.find()" and in few cases there are no description for the anime, so that might give an error. Should I fix it? it only needs the try and except thing.

TheLethalCode commented 5 years ago

Nah, it is actually more work than you think. If you change, you have to keep on changing your ip before every request, and the changing itself takes time. And anyway, there is no error till 10000, I have handled them.

prashantramnani commented 5 years ago

Yeah,there was a pretty nice article on rotating your ip's but fine anyway. Cool then I guess this thing is done.