blaylockbk / Herbie

Download numerical weather prediction datasets (HRRR, RAP, GFS, IFS, etc.) from NOMADS, NODD partners (Amazon, Google, Microsoft), ECMWF open data, and the University of Utah Pando Archive System.
https://herbie.readthedocs.io/
MIT License
503 stars 75 forks source link

Download "killed" after about 4 months #175

Open moptis opened 1 year ago

moptis commented 1 year ago

Hello, I'm trying to download a 2 year timeseries of historical wind speed forecasts from RAP. Every time I get through about 4 months of hourly data, the process is killed (see below). Are there download thresholds put in place that I can adjust?

image

blaylockbk commented 1 year ago

That's interesting. Herbie doesn't limit downloads, but Amazon might impose limits I'm not aware of, or your server is timing out. You might need to download data in chunks. If you are downloading lots of data, you might have better luck with a different tool like rclone

If you go that route, I have a tutorial here: https://github.com/blaylockbk/pyBKB_v3/blob/master/rclone_howto.md

coliveir-aer commented 1 year ago

You could also be running out of memory if all the files are being loaded as they are downloaded. The "Killed" in Python output is a telltale sign of memory starvation.

moptis commented 1 year ago

@blaylockbk thanks I'll check it out!

@coliveir-aer Great catch! I'm slowly watching my RAM decrease as they download!

@blaylockbk Is there a way to close out a Herbie object?

blaylockbk commented 1 year ago

This is great feedback.

Can you share more about how you are doing a download? I'm not sure where the memory is going when creating Herbie objects. Actually...now that I think about it...Herbie does cache the inventory DataFrame with each object if it is used. Can you provide more details on how you are doing a download?

If you are downloading full files, I don't think the inventory dataframe is cached

from herbie import Herbie
H = Herbie('2022-01-01', model='rap')
H.download()

But if you are subsetting the file, then the inventory file is loaded and cached.

from herbie import Herbie
H = Herbie('2022-01-01', model='rap')
H.download("TMP:2 m")

If that cached inventory file is where your memory is going, then perhaps you need to delete the cached property

del H.index_as_dataframe()

A H.close() method might be a useful feature to free up memory and uncache everything when downloading lots of files.

moptis commented 1 year ago

Thanks for the detailed response. I'm grabbing and storing timeseries data from a single lat/lon like this:

def download_data(
    self, 
    coords = (54.307, -130.914), 
    save_dir = "/home/ec2-user/rap", 
    start_date = '2021-03-08', 
    end_date = '2023-03-08', 
    freq = '1H', 
    horizon = 6, 
    fields = '10m wind speed'
):
        '''
        Return a dataframe of projects and key attributes

        Args: 
          coords(:obj:'tuple'): Lat/lon coordinates of target
          save_dir(:obj:'string'): Location to save timeseries data
          start_date(:obj:'string'): First date to download
          end_date(:obj:'string'): End date to download
          freq(:obj:'str'): Frequency of download data in datetime format
          horizon(:obj:'int'): Forecast horizon to download
          fields(:obj:'string'): Field to download (**NOT YET IMPLEMENTED)

        Returns:
          None
        '''
        date_range = pd.date_range(start_date, end_date, freq = freq)

        df = pd.DataFrame(index = date_range)
        for i,d in enumerate(date_range):
            if i % 50 == 0:
                df.to_csv(os.path.join(save_dir, "%s_%s_%s.csv" % (self.product, coords[0], coords[1],)))
            try:
                H = Herbie(str(d), model = self.product, product = "awp130pgrb", fxx = horizon)
                ds = H.xarray(searchString=":(?:U|V)GRD:10 m above ground")
                u = ds.herbie.nearest_points(points=(coords[1], coords[0]))['u10'].values[0]
                v = ds.herbie.nearest_points(points=(coords[1], coords[0]))['v10'].values[0]
                print(u,v)
                df.loc[d, 'u_10m'] = u
                df.loc[d, 'v_10m'] = v
            except:
                print("missing data for %s" % str(d))
moptis commented 1 year ago

I can't delete based on the code you provided, but even a simple "del H" command doesn't solve the memory issue

image

peterdudfield commented 1 year ago

If you are doing a loop to get the data, you could print with

process = psutil.Process(os.getpid())
printf"Memory is {process.memory_info().rss / 10 ** 6} MB")

and see if your memory is increasing (potentially towards you limit)

rodinia814 commented 1 year ago

@moptis

if i % 50 == 0:
        df.to_csv(os.path.join(save_dir, "%s_%s_%s.csv" % (self.product, coords[0], coords[1],)))

Are you intending to chunk the data into separate csv files? If not, you'll need mode='a'. If so, then you'll have to parameterize the filename with the index, or something similar.

Also, the loop just keeps adding to the DataFrame instance. It never clears it after being written to the file(either via .drop or creating a new DataFrame). That's most likely why the process is running out of memory.

williamhobbs commented 1 year ago

@moptis Did you find a good solution here? I'm trying to do something similar with GEFS and (maybe) HRRR.

I'm using FastHerbie() and getting errors like:

Exception has occured : HTTPSConnectionPool(host='noaa-gefs-pds.s3.amazonaws.com', port=443): Read timed out. (read timeout=None)

and

Exception has occured : ('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))
Exception has occured : HTTPSConnectionPool(host='noaa-gefs-pds.s3.amazonaws.com', port=443): Max retries exceeded with url: /gefs.20210220/06/atmos/pgrb2ap5/gep23.t06z.pgrb2a.0p50.f039.idx (Caused by SSLError(SSLEOFError(8, '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1007)')))

Which I think is a different issue.

aaTman commented 4 months ago

I had similar memory issues where I would be using the .xarray method from FastHerbie for multiple days + forecast hours in a loop. I'd watch the memory on my machine slowly go to zero despite variables being overwritten. I haven't done a deep look into where in FastHerbie this is occurring, but I set up a loop with a garbage collector that seems to have resolved the issue:


# convert list of lats and lons into a df that pick_points() can use
def generate_points(latitude,longitude):
    if type(latitude) is not list:
        latitude = [latitude]
    if type(longitude) is not list:
        longitude = [longitude]
    points = pd.DataFrame(
        {
            "longitude": longitude,
            "latitude": latitude,
        }
    )
    return points

# create a 00z/12z date range object
hrrr_date_range = pd.date_range(df['datetime'].min(), df['datetime'].max(), freq='12H')

# pull relevant variables
variables_wgrib2 = (
    ":TMP:2 m|GRD:10 m|DSWRF" 
)

points = generate_points(list(df['latitude'].unique()),list(df['longitude'].unique()))

large_hrrr_df = pd.DataFrame()
for date in hrrr_date_range:
    print(date)
    fh_hrrr_object = FastHerbie([hrrr_date_range[0]], fxx=range(0,48), model='hrrr')
    hrrrs = [
        H.xarray(variables_wgrib2, remove_grib=False, max_threads=48)
        for H in fh_hrrr_object.file_exists
    ]
    ds_hrrrs = list(itertools.chain(*hrrrs))
    hrrr_df = pd.concat(
        [n.herbie.pick_points(points).to_dataframe() for n in ds_hrrrs]
    )
    hrrr_df = hrrr_df.groupby(['latitude','longitude','step']).max().reset_index()
    large_hrrr_df = pd.concat([large_hrrr_df, hrrr_df])
    gc.collect()

Knowing xarray and python, memory leaks are common. Sometimes gc doesn't work to resolve them so it's nice to see here there's at least one way to bandage the situation.

blaylockbk commented 4 months ago

Thanks for sharing this. Yes, FastHerbie could use some improvements.