climate-mirror / datasets

For tracking data mirroring progress
201 stars 18 forks source link

Climatological Rankings #22

Open nickrsan opened 7 years ago

nickrsan commented 7 years ago

Name: Climatological Rankings wer Organization: NOAA Description URL: https://www.ncdc.noaa.gov/temp-and-precip/climatological-rankings/ Download URL: https://www.ncdc.noaa.gov/temp-and-precip/climatological-rankings/download.csv?parameter=tavg&state=110&div=0&month=11&periods[]=1&year=2013 File Types: Size: Status:

detrout commented 7 years ago

Wrote and launched a crawler that spits out csvs by year for the listed site :

https://gist.github.com/detrout/828c73cb47cc9998da95d01d69d275b5

Given that you only get one row per HTTP request, I'm not sure how long its going to take all ~200 million rows.

bkirkbri commented 7 years ago

Cool! Maybe you could partition the list and run multiple processes?

detrout commented 7 years ago

I'm using grequests to use an async pool to do up to 50 simultaneous requests.... I'm not sure I want to stress my server out with more than that.

bkirkbri commented 7 years ago

You're way ahead of me. I should have read the gist more closely :)

detrout commented 7 years ago

The gist version of the program had a bug and generated a number of warnings.

I made a new, stronger version https://github.com/detrout/climate-scrapers that 1) uses more generators and fewer lists 2) lets you specify years to start downloading to make it easier to parallelize.

Currently I have 84M of 1984 data (possibly incomplete) and 11M of 2010 data

siennathesane commented 7 years ago

@detrout I just wanted to check in on the status of your copy to see how it's going. :)