Open nickrsan opened 7 years ago
Wrote and launched a crawler that spits out csvs by year for the listed site :
https://gist.github.com/detrout/828c73cb47cc9998da95d01d69d275b5
Given that you only get one row per HTTP request, I'm not sure how long its going to take all ~200 million rows.
Cool! Maybe you could partition the list and run multiple processes?
I'm using grequests to use an async pool to do up to 50 simultaneous requests.... I'm not sure I want to stress my server out with more than that.
You're way ahead of me. I should have read the gist more closely :)
The gist version of the program had a bug and generated a number of warnings.
I made a new, stronger version https://github.com/detrout/climate-scrapers that 1) uses more generators and fewer lists 2) lets you specify years to start downloading to make it easier to parallelize.
Currently I have 84M of 1984 data (possibly incomplete) and 11M of 2010 data
@detrout I just wanted to check in on the status of your copy to see how it's going. :)
Name: Climatological Rankings wer Organization: NOAA Description URL: https://www.ncdc.noaa.gov/temp-and-precip/climatological-rankings/ Download URL: https://www.ncdc.noaa.gov/temp-and-precip/climatological-rankings/download.csv?parameter=tavg&state=110&div=0&month=11&periods[]=1&year=2013 File Types: Size: Status: