Open bigximik opened 2 years ago
We have a speed limit for scraping github, repo homepages at least. From one ip address it is around 2 repo per second, but it is only 2-3 times faster from 20 different IP addresses ( from the same datacenter, toolkit). A lot of status code 429, rate limiting events. I wonder if it is general github policy or or datacenter just got lucky? Experiment code here https://github.com/bigcode-project/bigcode-analysis/blob/github_scraping_test/data_analysis/github_scraping_test/github_scrapping_test.ipynb
Maybe anyone can run this experiment on their ray cluster or just repeat the test any other way form their range of ip addresses?
We have a speed limit for scraping github, repo homepages at least. From one ip address it is around 2 repo per second, but it is only 2-3 times faster from 20 different IP addresses ( from the same datacenter, toolkit). A lot of status code 429, rate limiting events. I wonder if it is general github policy or or datacenter just got lucky? Experiment code here https://github.com/bigcode-project/bigcode-analysis/blob/github_scraping_test/data_analysis/github_scraping_test/github_scrapping_test.ipynb
Maybe anyone can run this experiment on their ray cluster or just repeat the test any other way form their range of ip addresses?