bigcode-project / bigcode-analysis

Repository for analysis and experiments in the BigCode project.
Apache License 2.0
113 stars 20 forks source link

github scraping speed limit #15

Open bigximik opened 1 year ago

bigximik commented 1 year ago

We have a speed limit for scraping github, repo homepages at least. From one ip address it is around 2 repo per second, but it is only 2-3 times faster from 20 different IP addresses ( from the same datacenter, toolkit). A lot of status code 429, rate limiting events. I wonder if it is general github policy or or datacenter just got lucky? Experiment code here https://github.com/bigcode-project/bigcode-analysis/blob/github_scraping_test/data_analysis/github_scraping_test/github_scrapping_test.ipynb

Maybe anyone can run this experiment on their ray cluster or just repeat the test any other way form their range of ip addresses?