Open chiaral opened 1 year ago
Hi Chiara,
We are currently looking into this.
Patrick Keown
Program Manager, NOAA Open Data Dissemination (NODD)
Office of the Chief Information Officer (OCIO)
National Oceanic & Atmospheric Administration
(615) 319-5906 | @.***
"Be sure when you step, step with care and great tact" - Dr. Seuss
Schedule a Meeting with Me! https://calendar.app.google/BWJGjd9f9JRLRzwC9
On Wed, Sep 13, 2023 at 10:47 PM Chiara Lepore @.***> wrote:
This is not necessary an exhaustive list of missing files, But 2004102400/p01/Days:1-10 is missing a lot of files. The idx files are there, not the actual grib files. Here https://noaa-gefs-retrospective.s3.amazonaws.com/index.html#GEFSv12/reforecast/2004/2004102400/p01/Days:1-10/ I have 76 items, here https://noaa-gefs-retrospective.s3.amazonaws.com/index.html#GEFSv12/reforecast/2004/2004102400/c00/Days:1-10/ instead I have 122.
Thanks!
— Reply to this email directly, view it on GitHub https://github.com/awslabs/open-data-registry/issues/1994, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOISIEAAU3WZAAPT7JGBFX3X2JV4VANCNFSM6AAAAAA4XMMBFM . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Continuing adding more as I go through the data. I found other 2 issues:
The easier one, in
s3://noaa-gefs-retrospective/GEFSv12/reforecast/2004/2004101700/p04/Days:1-10/
we are missing the apcp file, we only have the idx
file.
But the true easter egg 🤣 is the following one:
s3://noaa-gefs-retrospective/GEFSv12/reforecast/2006/2006033000/c00/Days:1-10/ugrd_hgt_2006033000_c00.grib2
and
s3://noaa-gefs-retrospective/GEFSv12/reforecast/2006/2006033000/c00/Days:1-10/vgrd_hgt_2006033000_c00.grib2
have the wrong valid_time
coordinates
!aws s3 cp s3://noaa-gefs-retrospective/GEFSv12/reforecast/2006/2006033000/c00/Days:1-10/ugrd_hgt_2006033000_c00.grib2 ufromaws.grib2
!wgrib2 -v ufromaws.grib2`
1:0:d=2004033000:UGRD U-Component of Wind [m/s]:10 m above ground:3 hour fcst:ENS=low-res ctl
2:806524:d=2004033000:UGRD U-Component of Wind [m/s]:100 m above ground:3 hour fcst:ENS=low-res ctl
3:1623430:d=2004033000:UGRD U-Component of Wind [m/s]:10 m above ground:6 hour fcst:ENS=low-res ctl
4:2428635:d=2004033000:UGRD U-Component of Wind [m/s]:100 m above ground:6 hour fcst:ENS=low-res ctl
5:3247497:d=2004033000:UGRD U-Component of Wind [m/s]:10 m above ground:9 hour fcst:ENS=low-res ctl
6:4058812:d=2004033000:UGRD U-Component of Wind [m/s]:100 m above ground:9 hour fcst:ENS=low-res ctl
7:4883679:d=2004033000:UGRD U-Component of Wind [m/s]:10 m above ground:12 hour fcst:ENS=low-res ctl
8:5705130:d=2004033000:UGRD U-Component of Wind [m/s]:100 m above ground:12 hour fcst:ENS=low-res ctl
9:6537556:d=2004033000:UGRD U-Component of Wind [m/s]:10 m above ground:15 hour fcst:ENS=low-res ctl
10:7364475:d=2004033000:UGRD U-Component of Wind [m/s]:100 m above ground:15 hour fcst:ENS=low-res ctl
11:8199647:d=2004033000:UGRD U-Component of Wind [m/s]:10 m above ground:18 hour fcst:ENS=low-res ctl
Also with xarray/cfgrib
import cfgrib
u = xr.open_dataset('ufromaws.grib2', engine="cfgrib",
backend_kwargs={"filter_by_keys": {"shortName": "10u"}},
)
u.valid_time.values
array(['2004-03-30T03:00:00.000000000', '2004-03-30T06:00:00.000000000',
'2004-03-30T09:00:00.000000000', '2004-03-30T12:00:00.000000000',
'2004-03-30T15:00:00.000000000', '2004-03-30T18:00:00.000000000',
'2004-03-30T21:00:00.000000000', '2004-03-31T00:00:00.000000000',
'2004-03-31T03:00:00.000000000', '2004-03-31T06:00:00.000000000',
'2004-03-31T09:00:00.000000000', '2004-03-31T12:00:00.000000000',
'2004-03-31T15:00:00.000000000', '2004-03-31T18:00:00.000000000',
'2004-03-31T21:00:00.000000000', '2004-04-01T00:00:00.000000000',
'2004-04-01T03:00:00.000000000', '2004-04-01T06:00:00.000000000',
'2004-04-01T09:00:00.000000000', '2004-04-01T12:00:00.000000000',
'2004-04-01T15:00:00.000000000', '2004-04-01T18:00:00.000000000',
'2004-04-01T21:00:00.000000000', '2004-04-02T00:00:00.000000000',
'2004-04-02T03:00:00.000000000', '2004-04-02T06:00:00.000000000',
'2004-04-02T09:00:00.000000000', '2004-04-02T12:00:00.000000000',
'2004-04-02T15:00:00.000000000', '2004-04-02T18:00:00.000000000',
'2004-04-02T21:00:00.000000000', '2004-04-03T00:00:00.000000000',
Thank you for the additional information. We have a scientist looking into this. Our team will reach back out once we have a resolution.
Thank you
Hello! I found something not missing but erroneous in the precipitation (it appears both in tp and acpcp) for one month so far. I have not done an exhaustive analysis, I bumped into this by pure luck.
In the following figures I have the 5 ensemble member for each column ('c00' to 'p04'), each row is a 3hourly interval starting from the start of the run (i.e. 00z)
For May 30th 2006 - all good (this is precipitation truncated to 10 mm for the first 3 hourly steps, 0-3
, 0-6
, 6-9
, and so on)
For June 1st 2006 🙃
For June 10th
then July 1st goes back to normal
for the whole month of June 2006 tp
and acpcp
are off for the first 2 time steps. (0-3
and 0-6
)
The issue tho is only for the 0-3
because if I do 0-6
minus 0-3
I get
I have looked at a handful of other variables and they seem all ok, but in all honestly I have not looked at all of them. Also I picked 2006-06 by chance, so I am not sure how pervasive this is. I will do a little more random checks, but maybe you are aware of this issue?
Hi Chiara,
Can you reach out to me at @.***? We can loop in the data scientist to address this.
Thanks,
Patrick Keown
Program Manager, NOAA Open Data Dissemination (NODD)
Office of the Chief Information Officer (OCIO)
National Oceanic & Atmospheric Administration
(615) 319-5906 | @.***
"Be sure when you step, step with care and great tact" - Dr. Seuss
Schedule a Meeting with Me! https://calendar.app.google/BWJGjd9f9JRLRzwC9
On Fri, Sep 22, 2023 at 11:17 AM Chiara Lepore @.***> wrote:
Hello! I found something not missing but erroneous in the precipitation (it appears both in tp and acpcp) for one month so far. I have not done an exhaustive analysis, I bumped into this by pure luck.
In the following figures I have the 5 ensemble member for each column ('c00' to 'p04'), each row is a 3hourly interval starting from the start of the run (i.e. 00z)
For May 30th 2006 - all good (this is precipitation truncated to 10) [image: image] https://user-images.githubusercontent.com/8453445/269979189-51cd0be8-4b07-4ea4-aa42-d60ea2857d7c.png
For June 1st 2006 🙃 [image: image] https://user-images.githubusercontent.com/8453445/269979467-6f6ab94a-9d38-4c62-b8c7-aaf8b478f99b.png
For June 10th [image: image] https://user-images.githubusercontent.com/8453445/269979522-c75f39c3-e720-4098-9d08-b6b00a4cae5b.png
then July 1st goes back to normal [image: image] https://user-images.githubusercontent.com/8453445/269979591-a993c284-4440-46af-86a8-72a54bdc9288.png
for the whole month of June 2006 tp and acpcp are off for the first 2 time steps. (0-3 and 0-6) The issue tho is only for the 0-3 because if I do 0-6 minus 0-3 I get [image: image] https://user-images.githubusercontent.com/8453445/269980732-152f43db-cb57-4959-875b-e0c2366cb092.png
I have looked at a handful of other variables and they seem all ok, but in all honestly I have not looked at all of them. Also I picked 2006-06 by chance, so I am not sure how pervasive this is. I will do a little more random checks, but maybe you are aware of this issue?
— Reply to this email directly, view it on GitHub https://github.com/awslabs/open-data-registry/issues/1994#issuecomment-1731597863, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOISIEFSXPPEK3HAFCOROLLX3WTZ3ANCNFSM6AAAAAA4XMMBFM . You are receiving this because you commented.Message ID: @.***>
Thank you for bringing these data issues to our attention. We are working on fixing the issues you brought up on github (https://github.com/awslabs/open-data-registry/issues/1994). My coworker is fixing and sending the data for 2004102400, 2006033000 and 2006033000 to AWS. I believe she has mostly completed this process, but I will confirm with her when she returns from vacation.
Meanwhile, we are verifying and sending this data to our FTP server (ftp://ftp.emc.ncep.noaa.gov/GEFSv12). Please note that this FTP data cannot be accessed through any modern internet browser, but it can be publicly accessed using tools such as the ftp command (e.g. ftp ftp.emc.ncep.noaa.gov).
Hi Chiara,
We are continuing to fix the data issues that you have found.
Thank you.
Hi Chiara,
For June 1 2006, it looks like the f03 and f06 data has already been fixed on the EMC FTP: ftp://ftp.emc.ncep.noaa.gov/GEFSv12/reforecast/2006/06/01/ If you also see this same issue with the data on the EMC FTP for June 2006, please feel free to let me know. The fixes for June 2006 may have not all been carried over to AWS. We will work on bringing these f03 and f06 fixes to AWS.
Thank you.
Thanks for the update!! I only access them through aws so I will wait for that for sure!
Hello! Adding a new small issue,
some days - for now I identified only one day in 2001 11 15 for all ensemble members, i.e. this folder - the files are repeated twice but one has a missing digit in the date of the filename
(correct date 2001 11 15) acpcp_sfc_2001111500_p01.grib2 | 3 years ago | 2021-02-17 11:19:41 | 28 MB (wrong date 2001 11 5) acpcp_sfc_200111500_p01.grib2 | 3 years ago | 2021-02-17 11:19:26 | 30 MB
Problem is that they are different for the first two time steps for some variables.
this is for accumulated precip
one = cfgrib.open_dataset('acpcp_sfc_2001111500_c00.grib2')
two = cfgrib.open_dataset('acpcp_sfc_200111500_c00.grib2')
(two.acpcp- one.acpcp).sum(dim=['latitude', 'longitude'])
array([997719.06, 997308.06, 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. ], dtype=float32)
with
(one.isel(step=slice(0,2))-two.isel(step=slice(0,2))).acpcp.plot(col='step')
surface pressure seem to be identical in both files helicity too
I can't check them all, so I was wondering if you have any guidance. In the case of acpcp the differences are such that the one with the wrong date is much more wet (probably the first step is problematic and the second one carries the value in the accumulation). ButI just thought to let you know.
Hello! Adding a new small issue,
some days - for now I identified only one day in 2001 11 15 for all ensemble members, i.e. this folder - the files are repeated twice but one has a missing digit in the date of the filename
(correct date 2001 11 15) acpcp_sfc_2001111500_p01.grib2 | 3 years ago | 2021-02-17 11:19:41 | 28 MB (wrong date 2001 11 5) acpcp_sfc_200111500_p01.grib2 | 3 years ago | 2021-02-17 11:19:26 | 30 MB
Problem is that they are different for the first two time steps for some variables.
this is for accumulated precip
one = cfgrib.open_dataset('acpcp_sfc_2001111500_c00.grib2') two = cfgrib.open_dataset('acpcp_sfc_200111500_c00.grib2') (two.acpcp- one.acpcp).sum(dim=['latitude', 'longitude']) array([997719.06, 997308.06, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ], dtype=float32)
with
(one.isel(step=slice(0,2))-two.isel(step=slice(0,2))).acpcp.plot(col='step')
surface pressure seem to be identical in both files helicity too
I can't check them all, so I was wondering if you have any guidance. In the case of acpcp the differences are such that the one with the wrong date is much more wet (probably the first step is problematic and the second one carries the value in the accumulation). ButI just thought to let you know.
@chiaral Thank you for bringing this to our attention. We are investigating 2001111500.
Hello! I found something not missing but erroneous in the precipitation (it appears both in tp and acpcp) for one month so far. I have not done an exhaustive analysis, I bumped into this by pure luck.
for the whole month of June 2006
tp
andacpcp
are off for the first 2 time steps. (0-3
and0-6
) The issue tho is only for the0-3
because if I do0-6
minus0-3
I have looked at a handful of other variables and they seem all ok, but in all honestly I have not looked at all of them. Also I picked 2006-06 by chance, so I am not sure how pervasive this is. I will do a little more random checks, but maybe you are aware of this issue?
The f03 and f06 fixes for June 2006 have recently been sent to AWS.
Hello! Adding a new small issue,
some days - for now I identified only one day in 2001 11 15 for all ensemble members, i.e. this folder - the files are repeated twice but one has a missing digit in the date of the filename
(correct date 2001 11 15) acpcp_sfc_2001111500_p01.grib2 | 3 years ago | 2021-02-17 11:19:41 | 28 MB (wrong date 2001 11 5) acpcp_sfc_200111500_p01.grib2 | 3 years ago | 2021-02-17 11:19:26 | 30 MB
Problem is that they are different for the first two time steps for some variables.
surface pressure seem to be identical in both files helicity too
I can't check them all, so I was wondering if you have any guidance. In the case of acpcp the differences are such that the one with the wrong date is much more wet (probably the first step is problematic and the second one carries the value in the accumulation). ButI just thought to let you know.
The files with the incorrect date ("200111500") in the filename have been removed from AWS for 20011115. The corrected f03 and f06 data has also been sent to AWS. Many thanks to my co-worker for managing the data on AWS.
Hello!
the wrong valid_time (april vs june) that I had identified for ugrd_hgt_2006033000_c0 and vgrd_hgt_2006033000_c0, I found it for cape_sfc and spfh_2m as well (same date and ensemble).
Hi @chiaral, we are working on correcting the valid_time for cape_sfc and spfh_2m.
(EDITED) After more hiccups here and there, I realized that also all the other ensembles member - and not just c00, have the same issue of using the wrong year (2004 instead of 2006) that I found for ugrd_hgt, vgrd_hgt, cape_sfc, and spfh_2m. I also found the u/vgrd_pres_abv700mb_2006033000 have it. So I'd probably check other variables as well.
@chiaral Thank you for bringing this to our attention. We are investigating and correcting the incorrect valid times for 2006033000.
@chiaral The issue regarding the incorrect valid times in the 2006033000 grib2 files has been resolved. After further investigation, we found that this issue occurred because 2004033000 data was being mislabeled as "2006033000" in the grib2 filename for days 1-10. The correct 2006033000 data is now being used in the 2006033000 grib2 files. The actual 2006033000 data, however, contains incomplete records in the "abv" files for days 1-10. Unfortunately, we are unable to recover this missing 2006033000 data in the "abv" files for days 1-10.
Thanks - so just to understand better, should I update only the 20060330 data or should I also refresh 20040330 data? It's unclear to me. And is this being propagated to AWS or only on ftp? It's ok about the missing data. thanks.
@chiaral The changes that were explained in my previous message have been propagated to AWS. You should update the 2006033000 data only. Previously, the data labelled as "2006033000" in the filename was actually 2004033000 data for days 1-10. There is no need to update the 2004033000 data.
Hello
I am now looking at the files after 2010. the file apcp_sfc_2012051700_c00 - but i think this is true for multiple variables because it was failing across many variables - has two different start time. in particular
import cfgrib
il = 'apcp_sfc_2012051700_c00.grib2'
dclist = cfgrib.open_datasets(
il, backend_kwargs={"extra_coords": {"stepRange": "step"}}
)
dclist
[<xarray.Dataset>
Dimensions: (time: 2, step: 80, latitude: 721, longitude: 1440)
Coordinates:
number int64 0
* time (time) datetime64[ns] 2008-05-17 2012-05-17
* step (step) timedelta64[ns] 0 days 03:00:00 ... 10 days 00:00:00
surface float64 0.0
* latitude (latitude) float64 90.0 89.75 89.5 89.25 ... -89.5 -89.75 -90.0
* longitude (longitude) float64 0.0 0.25 0.5 0.75 ... 359.2 359.5 359.8
valid_time (time, step) datetime64[ns] dask.array<chunksize=(2, 80), meta=np.ndarray>
stepRange (step) <U7 dask.array<chunksize=(80,), meta=np.ndarray>
Data variables:
tp (time, step, latitude, longitude) float32 dask.array<chunksize=(2, 80, 721, 1440), meta=np.ndarray>
Attributes:
GRIB_edition: 2
GRIB_centre: kwbc
GRIB_centreDescription: US National Weather Service - NCEP
GRIB_subCentre: 2
Conventions: CF-1.7
institution: US National Weather Service - NCEP]
The problem is
* time (time) datetime64[ns] 2008-05-17 2012-05-17
If your pipeline gets the name from the filename, it won't have issues, if it assumes that there is one value, it will break.
Hi @chiaral, Thank you for providing details regarding this issue. We are investigating.
Hi @chiaral, the issue that you found in 2012051700 has been corrected and sent to AWS.
Fantastic! thanks so much for your work!
This is not necessary an exhaustive list of missing files, But 2004102400/p01/Days:1-10 is missing a lot of files. The
idx
files are there, not the actualgrib
files. Here I have 76 items, here instead I have 122.Thanks!