Open CannotTouch opened 1 year ago
I'm still going in 429 with 5 max thread
setting max thread to 1 works for me although really slow
suggest: probably it should use metadata to don't scrape older stuff if the scrape it was yet done before or at least don't go everytime till the start.
a lil update on some test: I have done a fresh install with the commit f60d2614fa9553c96ba7c2b39c30d044e155903f and all works correctly with the max thread to 5. I receive the 429 error just when it scrape the list of subscription, probably it don't use the limit there. but resetting the ip it finish to scrape it and than i have scraped a model with over 2000 files without going in the 429 error.
well the rate limit seems to trigger on metadata scans, can we limit that to x threads.
we are waiting that the owner will find the time and the way to solve it, at the moment for me seems quite randomly stuff... sometimes i receive the error, sometimes not. probably OF have set variables limits based on their traffics... i don't know...
As long as you have the previous response it should be possible to avoid a lot of scraping I'm actually going to add this to my fork very soon. Just progressing through all the different post types
https://github.com/excludedBittern8/ofscraper
edit: caching may not be possible the downloads have a policy key, and it seems to change frequently possibly every day. Without it you can not download.
@DIGITALCRIMINALS any news about a workaround? :p
@DIGITALCRIMINALS any news about a workaround? :p
Sorry, yes I have fixed it on my end and I know what's wrong. It's to do with the script throwing exceptions in the network manager.
Every uncaught exception it throws takes up a semaphore/thread and causes it to hang (or loop) forever. I've already handled it on my end but the commit relies on another unfinished commit that changes the way downloads are handled.
I don't mind pushing the commit that handles the exception, but the script won't be able to report that a download failed.
take your time, don't worries. I'm happy to know about this news. Thanks for support :)
I probably fixed it in the latest commit. https://github.com/DIGITALCRIMINALS/UltimaScraperAPI/blob/93b7fd08ab7153e583cfa1c5ae50aab7878c8dab/ultima_scraper_api/managers/session_manager.py#L176
Script will detect 429 (rate limit) and automatically resolve itself by checking every 5 seconds. https://github.com/DIGITALCRIMINALS/UltimaScraperAPI/blob/93b7fd08ab7153e583cfa1c5ae50aab7878c8dab/ultima_scraper_api/managers/session_manager.py#L149
Personally I found that they only allow 1K requests per IP every 5 minutes. OF resets the rate limit every 5 minutes. You can still batch thousands of requests before the OF rate limiter kicks in.
I probably fixed it in the latest commit. https://github.com/DIGITALCRIMINALS/UltimaScraperAPI/blob/93b7fd08ab7153e583cfa1c5ae50aab7878c8dab/ultima_scraper_api/managers/session_manager.py#L176
Script will detect 429 (rate limit) and automatically resolve itself by checking every 5 seconds. https://github.com/DIGITALCRIMINALS/UltimaScraperAPI/blob/93b7fd08ab7153e583cfa1c5ae50aab7878c8dab/ultima_scraper_api/managers/session_manager.py#L149
Personally I found that they only allow 1K requests per IP every 5 minutes. OF resets the rate limit every 5 minutes. You can still batch thousands of requests before the OF rate limiter kicks in.
session_manager.py is not included in the latest UltimaScraper commit, does it get added when udpating? or should we manually add the UltimaScraperAPI?
right now my setup is working, slow but I think good enough so I wanna be sure and not mess something up by replacIing files I shouldn't
I'm currently processing a user with over 3,300 posts, and the script has been downloading at over 150 Mbps for over 24 hours and is still going at it...so the rate limiting definitely seems to be solved with the latest commit, although I'm not sure if that runtime is normal.
Depends on how many threads you've set it at
I probably fixed it in the latest commit. https://github.com/DIGITALCRIMINALS/UltimaScraperAPI/blob/93b7fd08ab7153e583cfa1c5ae50aab7878c8dab/ultima_scraper_api/managers/session_manager.py#L176
Script will detect 429 (rate limit) and automatically resolve itself by checking every 5 seconds. https://github.com/DIGITALCRIMINALS/UltimaScraperAPI/blob/93b7fd08ab7153e583cfa1c5ae50aab7878c8dab/ultima_scraper_api/managers/session_manager.py#L149
Personally I found that they only allow 1K requests per IP every 5 minutes. OF resets the rate limit every 5 minutes. You can still batch thousands of requests before the OF rate limiter kicks in.
@DIGITALCRIMINALS thanks for the fix but at the moment i cannot test it becasue sadly i'm stumbled upon another error https://github.com/DIGITALCRIMINALS/UltimaScraper/issues/953
I probably fixed it in the latest commit. https://github.com/DIGITALCRIMINALS/UltimaScraperAPI/blob/93b7fd08ab7153e583cfa1c5ae50aab7878c8dab/ultima_scraper_api/managers/session_manager.py#L176 Script will detect 429 (rate limit) and automatically resolve itself by checking every 5 seconds. https://github.com/DIGITALCRIMINALS/UltimaScraperAPI/blob/93b7fd08ab7153e583cfa1c5ae50aab7878c8dab/ultima_scraper_api/managers/session_manager.py#L149 Personally I found that they only allow 1K requests per IP every 5 minutes. OF resets the rate limit every 5 minutes. You can still batch thousands of requests before the OF rate limiter kicks in.
session_manager.py is not included in the latest UltimaScraper commit, does it get added when updating? or should we manually add the UltimaScraperAPI?
@DIGITALCRIMINALS
I probably fixed it in the latest commit. https://github.com/DIGITALCRIMINALS/UltimaScraperAPI/blob/93b7fd08ab7153e583cfa1c5ae50aab7878c8dab/ultima_scraper_api/managers/session_manager.py#L176 Script will detect 429 (rate limit) and automatically resolve itself by checking every 5 seconds. https://github.com/DIGITALCRIMINALS/UltimaScraperAPI/blob/93b7fd08ab7153e583cfa1c5ae50aab7878c8dab/ultima_scraper_api/managers/session_manager.py#L149 Personally I found that they only allow 1K requests per IP every 5 minutes. OF resets the rate limit every 5 minutes. You can still batch thousands of requests before the OF rate limiter kicks in.
session_manager.py is not included in the latest UltimaScraper commit, does it get added when updating? or should we manually add the UltimaScraperAPI?
@DIGITALCRIMINALS
It isn't in the same directory but is downloaded, just run the update command to keep it updated at the latest version.
I think OF have changed something so now while scraping it stuck, if you try to load from browser you receive the HTTP ERROR 429 so it's a temporary ban for too many request (to avoid it just change your IP and the time is resetted and the script restart to work correctly).
How we can set it better to avoid it? (if it's possible insert some delay between requests)