awslabs / open-data-docs

Additional documentation for publicly available datasets on AWS
Other
134 stars 25 forks source link

Update README: Naming convention change starting 2016/06/02 #20

Closed msteckle closed 1 year ago

msteckle commented 1 year ago

I was parsing through object keys in the AWS bucket noaa-nexrad-level2 and hit an issue on and after 2016/06/02 where the file names no longer contain the .gz ending. Just an inconsistency that I wanted to point out because the documentation says they all end in .gz in the README. I think the README could also include that there are .tar files in the bucket as well, not just .gz, that follow a different naming structure.

Below is an example of my script running into one of those unmentioned .tar files. And if I specify .gz then I hit the 2016/06/02 issue, which in my case means that the file names are not identified.

My script:

# This definition finds file names the fit a time range requirement
def find_station_data(rng, date, station, torn_id):

    filtr = f'{date.year}/{date.month:02d}/{date.day:02d}/{station}/'
    etime = int(f'{date.hour:02d}{date.minute:02d}{date.second:02d}')
    files = []

    for obj in bucket.objects.filter(Prefix=filtr):
        name = str(obj.key)
        # if name.endswith('.gz'): <- moot after 2016/06/02
        time = int(name[29:35])
        stime = etime - rng
        if stime <= time <= etime:
            files.append(name)            
        print(name, end='\r')

    #print(date, end='\r')

    if not files:
        return np.nan, torn_id  
    else:
        return files[-1], torn_id # get most recent data file name

My error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/l1/4qdwx4895x16389z9fjd6y1r0000gn/T/ipykernel_43173/110904081.py in <module>
      1 # make a df of station file names and tornado ID
----> 2 radar_files = torns_oi.apply(lambda x: find_station_data(360,
      3                                                          x['mo_dy_yr_time'],
      4                                                          x['siteID'],
      5                                                          x['torn_id']),

~/Anaconda/anaconda3/envs/stormclass/lib/python3.9/site-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwargs)
   9553             kwargs=kwargs,
   9554         )
-> 9555         return op.apply().__finalize__(self, method="apply")
   9556 
   9557     def applymap(

~/Anaconda/anaconda3/envs/stormclass/lib/python3.9/site-packages/pandas/core/apply.py in apply(self)
    744             return self.apply_raw()
    745 
--> 746         return self.apply_standard()
    747 
    748     def agg(self):

~/Anaconda/anaconda3/envs/stormclass/lib/python3.9/site-packages/pandas/core/apply.py in apply_standard(self)
    871 
    872     def apply_standard(self):
--> 873         results, res_index = self.apply_series_generator()
    874 
    875         # wrap results

~/Anaconda/anaconda3/envs/stormclass/lib/python3.9/site-packages/pandas/core/apply.py in apply_series_generator(self)
    887             for i, v in enumerate(series_gen):
    888                 # ignore SettingWithCopy here in case the user mutates
--> 889                 results[i] = self.f(v)
    890                 if isinstance(results[i], ABCSeries):
    891                     # If we have a view on v, we need to make a copy because

/var/folders/l1/4qdwx4895x16389z9fjd6y1r0000gn/T/ipykernel_43173/110904081.py in <lambda>(x)
      1 # make a df of station file names and tornado ID
----> 2 radar_files = torns_oi.apply(lambda x: find_station_data(360,
      3                                                          x['mo_dy_yr_time'],
      4                                                          x['siteID'],
      5                                                          x['torn_id']),

/var/folders/l1/4qdwx4895x16389z9fjd6y1r0000gn/T/ipykernel_43173/127108330.py in find_station_data(rng, date, station, torn_id)
      7     for obj in bucket.objects.filter(Prefix=filtr):
      8         name = str(obj.key)
----> 9         time = int(name[29:35])
     10         stime = etime - rng
     11         if stime <= time <= etime:

ValueError: invalid literal for int() with base 10: 'L2LG_K'
msteckle commented 1 year ago

The README in question: open-data-docs/docs/noaa/noaa-nexrad/README.md

Patrick-Keown commented 1 year ago

Thank you for pointing this out. I will be updating the readme to better assist in accessing the data.

cstner commented 1 year ago

README updated by Patrick: https://github.com/awslabs/open-data-docs/pull/21

msteckle commented 1 year ago

Thank you!