Open moptis opened 1 year ago
That's interesting. Herbie doesn't limit downloads, but Amazon might impose limits I'm not aware of, or your server is timing out. You might need to download data in chunks. If you are downloading lots of data, you might have better luck with a different tool like rclone
If you go that route, I have a tutorial here: https://github.com/blaylockbk/pyBKB_v3/blob/master/rclone_howto.md
You could also be running out of memory if all the files are being loaded as they are downloaded. The "Killed" in Python output is a telltale sign of memory starvation.
@blaylockbk thanks I'll check it out!
@coliveir-aer Great catch! I'm slowly watching my RAM decrease as they download!
@blaylockbk Is there a way to close out a Herbie object?
This is great feedback.
Can you share more about how you are doing a download? I'm not sure where the memory is going when creating Herbie objects. Actually...now that I think about it...Herbie does cache the inventory DataFrame with each object if it is used. Can you provide more details on how you are doing a download?
If you are downloading full files, I don't think the inventory dataframe is cached
from herbie import Herbie
H = Herbie('2022-01-01', model='rap')
H.download()
But if you are subsetting the file, then the inventory file is loaded and cached.
from herbie import Herbie
H = Herbie('2022-01-01', model='rap')
H.download("TMP:2 m")
If that cached inventory file is where your memory is going, then perhaps you need to delete the cached property
del H.index_as_dataframe()
A H.close()
method might be a useful feature to free up memory and uncache everything when downloading lots of files.
Thanks for the detailed response. I'm grabbing and storing timeseries data from a single lat/lon like this:
def download_data(
self,
coords = (54.307, -130.914),
save_dir = "/home/ec2-user/rap",
start_date = '2021-03-08',
end_date = '2023-03-08',
freq = '1H',
horizon = 6,
fields = '10m wind speed'
):
'''
Return a dataframe of projects and key attributes
Args:
coords(:obj:'tuple'): Lat/lon coordinates of target
save_dir(:obj:'string'): Location to save timeseries data
start_date(:obj:'string'): First date to download
end_date(:obj:'string'): End date to download
freq(:obj:'str'): Frequency of download data in datetime format
horizon(:obj:'int'): Forecast horizon to download
fields(:obj:'string'): Field to download (**NOT YET IMPLEMENTED)
Returns:
None
'''
date_range = pd.date_range(start_date, end_date, freq = freq)
df = pd.DataFrame(index = date_range)
for i,d in enumerate(date_range):
if i % 50 == 0:
df.to_csv(os.path.join(save_dir, "%s_%s_%s.csv" % (self.product, coords[0], coords[1],)))
try:
H = Herbie(str(d), model = self.product, product = "awp130pgrb", fxx = horizon)
ds = H.xarray(searchString=":(?:U|V)GRD:10 m above ground")
u = ds.herbie.nearest_points(points=(coords[1], coords[0]))['u10'].values[0]
v = ds.herbie.nearest_points(points=(coords[1], coords[0]))['v10'].values[0]
print(u,v)
df.loc[d, 'u_10m'] = u
df.loc[d, 'v_10m'] = v
except:
print("missing data for %s" % str(d))
I can't delete based on the code you provided, but even a simple "del H" command doesn't solve the memory issue
If you are doing a loop to get the data, you could print with
process = psutil.Process(os.getpid())
printf"Memory is {process.memory_info().rss / 10 ** 6} MB")
and see if your memory is increasing (potentially towards you limit)
@moptis
if i % 50 == 0:
df.to_csv(os.path.join(save_dir, "%s_%s_%s.csv" % (self.product, coords[0], coords[1],)))
Are you intending to chunk the data into separate csv files? If not, you'll need mode='a'
. If so, then you'll have to parameterize the filename with the index, or something similar.
Also, the loop just keeps adding to the DataFrame instance. It never clears it after being written to the file(either via .drop
or creating a new DataFrame). That's most likely why the process is running out of memory.
@moptis Did you find a good solution here? I'm trying to do something similar with GEFS and (maybe) HRRR.
I'm using FastHerbie()
and getting errors like:
Exception has occured : HTTPSConnectionPool(host='noaa-gefs-pds.s3.amazonaws.com', port=443): Read timed out. (read timeout=None)
and
Exception has occured : ('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))
Exception has occured : HTTPSConnectionPool(host='noaa-gefs-pds.s3.amazonaws.com', port=443): Max retries exceeded with url: /gefs.20210220/06/atmos/pgrb2ap5/gep23.t06z.pgrb2a.0p50.f039.idx (Caused by SSLError(SSLEOFError(8, '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1007)')))
Which I think is a different issue.
I had similar memory issues where I would be using the .xarray method from FastHerbie for multiple days + forecast hours in a loop. I'd watch the memory on my machine slowly go to zero despite variables being overwritten. I haven't done a deep look into where in FastHerbie this is occurring, but I set up a loop with a garbage collector that seems to have resolved the issue:
# convert list of lats and lons into a df that pick_points() can use
def generate_points(latitude,longitude):
if type(latitude) is not list:
latitude = [latitude]
if type(longitude) is not list:
longitude = [longitude]
points = pd.DataFrame(
{
"longitude": longitude,
"latitude": latitude,
}
)
return points
# create a 00z/12z date range object
hrrr_date_range = pd.date_range(df['datetime'].min(), df['datetime'].max(), freq='12H')
# pull relevant variables
variables_wgrib2 = (
":TMP:2 m|GRD:10 m|DSWRF"
)
points = generate_points(list(df['latitude'].unique()),list(df['longitude'].unique()))
large_hrrr_df = pd.DataFrame()
for date in hrrr_date_range:
print(date)
fh_hrrr_object = FastHerbie([hrrr_date_range[0]], fxx=range(0,48), model='hrrr')
hrrrs = [
H.xarray(variables_wgrib2, remove_grib=False, max_threads=48)
for H in fh_hrrr_object.file_exists
]
ds_hrrrs = list(itertools.chain(*hrrrs))
hrrr_df = pd.concat(
[n.herbie.pick_points(points).to_dataframe() for n in ds_hrrrs]
)
hrrr_df = hrrr_df.groupby(['latitude','longitude','step']).max().reset_index()
large_hrrr_df = pd.concat([large_hrrr_df, hrrr_df])
gc.collect()
Knowing xarray and python, memory leaks are common. Sometimes gc doesn't work to resolve them so it's nice to see here there's at least one way to bandage the situation.
Thanks for sharing this. Yes, FastHerbie could use some improvements.
Hello, I'm trying to download a 2 year timeseries of historical wind speed forecasts from RAP. Every time I get through about 4 months of hourly data, the process is killed (see below). Are there download thresholds put in place that I can adjust?