awslabs / open-data-registry

A registry of publicly available datasets on AWS
https://registry.opendata.aws
Apache License 2.0
1.35k stars 853 forks source link

GEFS Re-forecast files missing #1994

Open chiaral opened 9 months ago

chiaral commented 9 months ago

This is not necessary an exhaustive list of missing files, But 2004102400/p01/Days:1-10 is missing a lot of files. The idx files are there, not the actual grib files. Here I have 76 items, here instead I have 122.

Thanks!

Patrick-Keown commented 9 months ago

Hi Chiara,

We are currently looking into this.

Patrick Keown

Program Manager, NOAA Open Data Dissemination (NODD)

Office of the Chief Information Officer (OCIO)

National Oceanic & Atmospheric Administration

(615) 319-5906 | @.***

"Be sure when you step, step with care and great tact" - Dr. Seuss

Schedule a Meeting with Me! https://calendar.app.google/BWJGjd9f9JRLRzwC9

On Wed, Sep 13, 2023 at 10:47 PM Chiara Lepore @.***> wrote:

This is not necessary an exhaustive list of missing files, But 2004102400/p01/Days:1-10 is missing a lot of files. The idx files are there, not the actual grib files. Here https://noaa-gefs-retrospective.s3.amazonaws.com/index.html#GEFSv12/reforecast/2004/2004102400/p01/Days:1-10/ I have 76 items, here https://noaa-gefs-retrospective.s3.amazonaws.com/index.html#GEFSv12/reforecast/2004/2004102400/c00/Days:1-10/ instead I have 122.

Thanks!

— Reply to this email directly, view it on GitHub https://github.com/awslabs/open-data-registry/issues/1994, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOISIEAAU3WZAAPT7JGBFX3X2JV4VANCNFSM6AAAAAA4XMMBFM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

chiaral commented 9 months ago

Continuing adding more as I go through the data. I found other 2 issues:

The easier one, in s3://noaa-gefs-retrospective/GEFSv12/reforecast/2004/2004101700/p04/Days:1-10/ we are missing the apcp file, we only have the idx file.

But the true easter egg 🤣 is the following one:

s3://noaa-gefs-retrospective/GEFSv12/reforecast/2006/2006033000/c00/Days:1-10/ugrd_hgt_2006033000_c00.grib2 and s3://noaa-gefs-retrospective/GEFSv12/reforecast/2006/2006033000/c00/Days:1-10/vgrd_hgt_2006033000_c00.grib2

have the wrong valid_time coordinates

!aws s3 cp s3://noaa-gefs-retrospective/GEFSv12/reforecast/2006/2006033000/c00/Days:1-10/ugrd_hgt_2006033000_c00.grib2 ufromaws.grib2
!wgrib2 -v ufromaws.grib2`
1:0:d=2004033000:UGRD U-Component of Wind [m/s]:10 m above ground:3 hour fcst:ENS=low-res ctl
2:806524:d=2004033000:UGRD U-Component of Wind [m/s]:100 m above ground:3 hour fcst:ENS=low-res ctl
3:1623430:d=2004033000:UGRD U-Component of Wind [m/s]:10 m above ground:6 hour fcst:ENS=low-res ctl
4:2428635:d=2004033000:UGRD U-Component of Wind [m/s]:100 m above ground:6 hour fcst:ENS=low-res ctl
5:3247497:d=2004033000:UGRD U-Component of Wind [m/s]:10 m above ground:9 hour fcst:ENS=low-res ctl
6:4058812:d=2004033000:UGRD U-Component of Wind [m/s]:100 m above ground:9 hour fcst:ENS=low-res ctl
7:4883679:d=2004033000:UGRD U-Component of Wind [m/s]:10 m above ground:12 hour fcst:ENS=low-res ctl
8:5705130:d=2004033000:UGRD U-Component of Wind [m/s]:100 m above ground:12 hour fcst:ENS=low-res ctl
9:6537556:d=2004033000:UGRD U-Component of Wind [m/s]:10 m above ground:15 hour fcst:ENS=low-res ctl
10:7364475:d=2004033000:UGRD U-Component of Wind [m/s]:100 m above ground:15 hour fcst:ENS=low-res ctl
11:8199647:d=2004033000:UGRD U-Component of Wind [m/s]:10 m above ground:18 hour fcst:ENS=low-res ctl

Also with xarray/cfgrib

import cfgrib
u = xr.open_dataset('ufromaws.grib2', engine="cfgrib",
                backend_kwargs={"filter_by_keys": {"shortName": "10u"}},
                )
u.valid_time.values
array(['2004-03-30T03:00:00.000000000', '2004-03-30T06:00:00.000000000',
       '2004-03-30T09:00:00.000000000', '2004-03-30T12:00:00.000000000',
       '2004-03-30T15:00:00.000000000', '2004-03-30T18:00:00.000000000',
       '2004-03-30T21:00:00.000000000', '2004-03-31T00:00:00.000000000',
       '2004-03-31T03:00:00.000000000', '2004-03-31T06:00:00.000000000',
       '2004-03-31T09:00:00.000000000', '2004-03-31T12:00:00.000000000',
       '2004-03-31T15:00:00.000000000', '2004-03-31T18:00:00.000000000',
       '2004-03-31T21:00:00.000000000', '2004-04-01T00:00:00.000000000',
       '2004-04-01T03:00:00.000000000', '2004-04-01T06:00:00.000000000',
       '2004-04-01T09:00:00.000000000', '2004-04-01T12:00:00.000000000',
       '2004-04-01T15:00:00.000000000', '2004-04-01T18:00:00.000000000',
       '2004-04-01T21:00:00.000000000', '2004-04-02T00:00:00.000000000',
       '2004-04-02T03:00:00.000000000', '2004-04-02T06:00:00.000000000',
       '2004-04-02T09:00:00.000000000', '2004-04-02T12:00:00.000000000',
       '2004-04-02T15:00:00.000000000', '2004-04-02T18:00:00.000000000',
       '2004-04-02T21:00:00.000000000', '2004-04-03T00:00:00.000000000',
Patrick-Keown commented 9 months ago

Thank you for the additional information. We have a scientist looking into this. Our team will reach back out once we have a resolution.

Thank you

chiaral commented 9 months ago

Hello! I found something not missing but erroneous in the precipitation (it appears both in tp and acpcp) for one month so far. I have not done an exhaustive analysis, I bumped into this by pure luck.

In the following figures I have the 5 ensemble member for each column ('c00' to 'p04'), each row is a 3hourly interval starting from the start of the run (i.e. 00z)

For May 30th 2006 - all good (this is precipitation truncated to 10 mm for the first 3 hourly steps, 0-3, 0-6, 6-9, and so on) image

For June 1st 2006 🙃 image

For June 10th image

then July 1st goes back to normal image

for the whole month of June 2006 tp and acpcp are off for the first 2 time steps. (0-3 and 0-6) The issue tho is only for the 0-3 because if I do 0-6 minus 0-3 I get image

I have looked at a handful of other variables and they seem all ok, but in all honestly I have not looked at all of them. Also I picked 2006-06 by chance, so I am not sure how pervasive this is. I will do a little more random checks, but maybe you are aware of this issue?

Patrick-Keown commented 9 months ago

Hi Chiara,

Can you reach out to me at @.***? We can loop in the data scientist to address this.

Thanks,

Patrick Keown

Program Manager, NOAA Open Data Dissemination (NODD)

Office of the Chief Information Officer (OCIO)

National Oceanic & Atmospheric Administration

(615) 319-5906 | @.***

"Be sure when you step, step with care and great tact" - Dr. Seuss

Schedule a Meeting with Me! https://calendar.app.google/BWJGjd9f9JRLRzwC9

On Fri, Sep 22, 2023 at 11:17 AM Chiara Lepore @.***> wrote:

Hello! I found something not missing but erroneous in the precipitation (it appears both in tp and acpcp) for one month so far. I have not done an exhaustive analysis, I bumped into this by pure luck.

In the following figures I have the 5 ensemble member for each column ('c00' to 'p04'), each row is a 3hourly interval starting from the start of the run (i.e. 00z)

For May 30th 2006 - all good (this is precipitation truncated to 10) [image: image] https://user-images.githubusercontent.com/8453445/269979189-51cd0be8-4b07-4ea4-aa42-d60ea2857d7c.png

For June 1st 2006 🙃 [image: image] https://user-images.githubusercontent.com/8453445/269979467-6f6ab94a-9d38-4c62-b8c7-aaf8b478f99b.png

For June 10th [image: image] https://user-images.githubusercontent.com/8453445/269979522-c75f39c3-e720-4098-9d08-b6b00a4cae5b.png

then July 1st goes back to normal [image: image] https://user-images.githubusercontent.com/8453445/269979591-a993c284-4440-46af-86a8-72a54bdc9288.png

for the whole month of June 2006 tp and acpcp are off for the first 2 time steps. (0-3 and 0-6) The issue tho is only for the 0-3 because if I do 0-6 minus 0-3 I get [image: image] https://user-images.githubusercontent.com/8453445/269980732-152f43db-cb57-4959-875b-e0c2366cb092.png

I have looked at a handful of other variables and they seem all ok, but in all honestly I have not looked at all of them. Also I picked 2006-06 by chance, so I am not sure how pervasive this is. I will do a little more random checks, but maybe you are aware of this issue?

— Reply to this email directly, view it on GitHub https://github.com/awslabs/open-data-registry/issues/1994#issuecomment-1731597863, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOISIEFSXPPEK3HAFCOROLLX3WTZ3ANCNFSM6AAAAAA4XMMBFM . You are receiving this because you commented.Message ID: @.***>

Patrick-Keown commented 9 months ago

Thank you for bringing these data issues to our attention. We are working on fixing the issues you brought up on github (https://github.com/awslabs/open-data-registry/issues/1994). My coworker is fixing and sending the data for 2004102400, 2006033000 and 2006033000 to AWS. I believe she has mostly completed this process, but I will confirm with her when she returns from vacation.

Meanwhile, we are verifying and sending this data to our FTP server (ftp://ftp.emc.ncep.noaa.gov/GEFSv12). Please note that this FTP data cannot be accessed through any modern internet browser, but it can be publicly accessed using tools such as the ftp command (e.g. ftp ftp.emc.ncep.noaa.gov).

  1. The missing data from 2004102400 is now available on this FTP: ftp://ftp.emc.ncep.noaa.gov/GEFSv12/reforecast/2004/10/24/
  2. We are working on fixing 20041017 and 2006033000.
  3. Regarding the erroneous precipitation values for f03 and f06, this is a known issue and a fix has been applied to most of the cases in the reforecast dataset. We are looking into June 2006 and will work on fixing this.
EricSinsky-NOAA commented 9 months ago

Hi Chiara,

We are continuing to fix the data issues that you have found.

  1. The missing data from 2004101700 is now available on the EMC FTP: ftp://ftp.emc.ncep.noaa.gov/GEFSv12/reforecast/2004/10/17/. I believe this data is also on AWS, but I will confirm with my coworker that she finished processing this data when she returns from vacation.
  2. With thanks to my coworker, she has corrected the time coordinates for 2006033000 for ugrd_hgt and vgrd_hgt. These can be found on AWS: https://noaa-gefs-retrospective.s3.amazonaws.com/index.html#GEFSv12/reforecast/2006/2006033000/c00/Days:1-10/
  3. We are working on making those corrections to f03 and f06 for June 2006.

Thank you.

EricSinsky-NOAA commented 9 months ago

Hi Chiara,

For June 1 2006, it looks like the f03 and f06 data has already been fixed on the EMC FTP: ftp://ftp.emc.ncep.noaa.gov/GEFSv12/reforecast/2006/06/01/ If you also see this same issue with the data on the EMC FTP for June 2006, please feel free to let me know. The fixes for June 2006 may have not all been carried over to AWS. We will work on bringing these f03 and f06 fixes to AWS.

Thank you.

chiaral commented 9 months ago

Thanks for the update!! I only access them through aws so I will wait for that for sure!

chiaral commented 7 months ago

Hello! Adding a new small issue,

some days - for now I identified only one day in 2001 11 15 for all ensemble members, i.e. this folder - the files are repeated twice but one has a missing digit in the date of the filename

(correct date 2001 11 15) acpcp_sfc_2001111500_p01.grib2 | 3 years ago | 2021-02-17 11:19:41 | 28 MB (wrong date 2001 11 5) acpcp_sfc_200111500_p01.grib2 | 3 years ago | 2021-02-17 11:19:26 | 30 MB

Problem is that they are different for the first two time steps for some variables.

this is for accumulated precip

one  = cfgrib.open_dataset('acpcp_sfc_2001111500_c00.grib2')
two  = cfgrib.open_dataset('acpcp_sfc_200111500_c00.grib2')

(two.acpcp- one.acpcp).sum(dim=['latitude', 'longitude'])

array([997719.06, 997308.06,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ], dtype=float32)

with

(one.isel(step=slice(0,2))-two.isel(step=slice(0,2))).acpcp.plot(col='step')

image

surface pressure seem to be identical in both files helicity too

I can't check them all, so I was wondering if you have any guidance. In the case of acpcp the differences are such that the one with the wrong date is much more wet (probably the first step is problematic and the second one carries the value in the accumulation). ButI just thought to let you know.

EricSinsky-NOAA commented 7 months ago

Hello! Adding a new small issue,

some days - for now I identified only one day in 2001 11 15 for all ensemble members, i.e. this folder - the files are repeated twice but one has a missing digit in the date of the filename

(correct date 2001 11 15) acpcp_sfc_2001111500_p01.grib2 | 3 years ago | 2021-02-17 11:19:41 | 28 MB (wrong date 2001 11 5) acpcp_sfc_200111500_p01.grib2 | 3 years ago | 2021-02-17 11:19:26 | 30 MB

Problem is that they are different for the first two time steps for some variables.

this is for accumulated precip

one  = cfgrib.open_dataset('acpcp_sfc_2001111500_c00.grib2')
two  = cfgrib.open_dataset('acpcp_sfc_200111500_c00.grib2')

(two.acpcp- one.acpcp).sum(dim=['latitude', 'longitude'])

array([997719.06, 997308.06,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ], dtype=float32)

with

(one.isel(step=slice(0,2))-two.isel(step=slice(0,2))).acpcp.plot(col='step')

image

surface pressure seem to be identical in both files helicity too

I can't check them all, so I was wondering if you have any guidance. In the case of acpcp the differences are such that the one with the wrong date is much more wet (probably the first step is problematic and the second one carries the value in the accumulation). ButI just thought to let you know.

@chiaral Thank you for bringing this to our attention. We are investigating 2001111500.

EricSinsky-NOAA commented 7 months ago

Hello! I found something not missing but erroneous in the precipitation (it appears both in tp and acpcp) for one month so far. I have not done an exhaustive analysis, I bumped into this by pure luck.

for the whole month of June 2006 tp and acpcp are off for the first 2 time steps. (0-3 and 0-6) The issue tho is only for the 0-3 because if I do 0-6 minus 0-3 I have looked at a handful of other variables and they seem all ok, but in all honestly I have not looked at all of them. Also I picked 2006-06 by chance, so I am not sure how pervasive this is. I will do a little more random checks, but maybe you are aware of this issue?

The f03 and f06 fixes for June 2006 have recently been sent to AWS.

EricSinsky-NOAA commented 7 months ago

Hello! Adding a new small issue,

some days - for now I identified only one day in 2001 11 15 for all ensemble members, i.e. this folder - the files are repeated twice but one has a missing digit in the date of the filename

(correct date 2001 11 15) acpcp_sfc_2001111500_p01.grib2 | 3 years ago | 2021-02-17 11:19:41 | 28 MB (wrong date 2001 11 5) acpcp_sfc_200111500_p01.grib2 | 3 years ago | 2021-02-17 11:19:26 | 30 MB

Problem is that they are different for the first two time steps for some variables.

surface pressure seem to be identical in both files helicity too

I can't check them all, so I was wondering if you have any guidance. In the case of acpcp the differences are such that the one with the wrong date is much more wet (probably the first step is problematic and the second one carries the value in the accumulation). ButI just thought to let you know.

The files with the incorrect date ("200111500") in the filename have been removed from AWS for 20011115. The corrected f03 and f06 data has also been sent to AWS. Many thanks to my co-worker for managing the data on AWS.

chiaral commented 5 months ago

Hello!

the wrong valid_time (april vs june) that I had identified for ugrd_hgt_2006033000_c0 and vgrd_hgt_2006033000_c0, I found it for cape_sfc and spfh_2m as well (same date and ensemble).

EricSinsky-NOAA commented 5 months ago

Hi @chiaral, we are working on correcting the valid_time for cape_sfc and spfh_2m.

chiaral commented 5 months ago

(EDITED) After more hiccups here and there, I realized that also all the other ensembles member - and not just c00, have the same issue of using the wrong year (2004 instead of 2006) that I found for ugrd_hgt, vgrd_hgt, cape_sfc, and spfh_2m. I also found the u/vgrd_pres_abv700mb_2006033000 have it. So I'd probably check other variables as well.

EricSinsky-NOAA commented 5 months ago

@chiaral Thank you for bringing this to our attention. We are investigating and correcting the incorrect valid times for 2006033000.

EricSinsky-NOAA commented 5 months ago

@chiaral The issue regarding the incorrect valid times in the 2006033000 grib2 files has been resolved. After further investigation, we found that this issue occurred because 2004033000 data was being mislabeled as "2006033000" in the grib2 filename for days 1-10. The correct 2006033000 data is now being used in the 2006033000 grib2 files. The actual 2006033000 data, however, contains incomplete records in the "abv" files for days 1-10. Unfortunately, we are unable to recover this missing 2006033000 data in the "abv" files for days 1-10.

chiaral commented 5 months ago

Thanks - so just to understand better, should I update only the 20060330 data or should I also refresh 20040330 data? It's unclear to me. And is this being propagated to AWS or only on ftp? It's ok about the missing data. thanks.

EricSinsky-NOAA commented 5 months ago

@chiaral The changes that were explained in my previous message have been propagated to AWS. You should update the 2006033000 data only. Previously, the data labelled as "2006033000" in the filename was actually 2004033000 data for days 1-10. There is no need to update the 2004033000 data.

chiaral commented 5 months ago

Hello

I am now looking at the files after 2010. the file apcp_sfc_2012051700_c00 - but i think this is true for multiple variables because it was failing across many variables - has two different start time. in particular

import cfgrib
il = 'apcp_sfc_2012051700_c00.grib2'
dclist = cfgrib.open_datasets(
            il,  backend_kwargs={"extra_coords": {"stepRange": "step"}}
        )
dclist

[<xarray.Dataset>
 Dimensions:     (time: 2, step: 80, latitude: 721, longitude: 1440)
 Coordinates:
     number      int64 0
   * time        (time) datetime64[ns] 2008-05-17 2012-05-17
   * step        (step) timedelta64[ns] 0 days 03:00:00 ... 10 days 00:00:00
     surface     float64 0.0
   * latitude    (latitude) float64 90.0 89.75 89.5 89.25 ... -89.5 -89.75 -90.0
   * longitude   (longitude) float64 0.0 0.25 0.5 0.75 ... 359.2 359.5 359.8
     valid_time  (time, step) datetime64[ns] dask.array<chunksize=(2, 80), meta=np.ndarray>
     stepRange   (step) <U7 dask.array<chunksize=(80,), meta=np.ndarray>
 Data variables:
     tp          (time, step, latitude, longitude) float32 dask.array<chunksize=(2, 80, 721, 1440), meta=np.ndarray>
 Attributes:
     GRIB_edition:            2
     GRIB_centre:             kwbc
     GRIB_centreDescription:  US National Weather Service - NCEP
     GRIB_subCentre:          2
     Conventions:             CF-1.7
     institution:             US National Weather Service - NCEP]

The problem is

   * time        (time) datetime64[ns] 2008-05-17 2012-05-17

If your pipeline gets the name from the filename, it won't have issues, if it assumes that there is one value, it will break.

EricSinsky-NOAA commented 5 months ago

Hi @chiaral, Thank you for providing details regarding this issue. We are investigating.

EricSinsky-NOAA commented 4 months ago

Hi @chiaral, the issue that you found in 2012051700 has been corrected and sent to AWS.

chiaral commented 4 months ago

Fantastic! thanks so much for your work!