MIT-AI-Accelerator / eie-sevir

Code for working with the SEVIR weather dataset
24 stars 11 forks source link

Min/Max Values in CATALOG don't match the actual data #2

Open markveillette opened 4 years ago

markveillette commented 4 years ago

There are rare cases in the dataset where the data_min and data_max columns in the catalog don't match the min/max measured from the actual (decoded) images.

For example, event R19011212048075 for img_type='ir069'. This entry in the CATALOG.csv is

id                                                  R19011212048075
file_name         ir069/2019/SEVIR_IR069_RANDOMEVENTS_2019_0101_...
file_index                                                      821
img_type                                                      ir069
time_utc                                        2019-01-12 12:00:00
minute_offsets    -120:-115:-110:-105:-100:-95:-90:-85:-80:-75:-...
episode_id                                                      NaN
event_id                                                        NaN
event_type                                                      NaN
llcrnrlat                                                   38.9436
llcrnrlon                                                  -92.3178
urcrnrlat                                                   42.0725
urcrnrlon                                                  -87.3715
proj              +proj=laea +lat_0=38 +lon_0=-98 +units=m +a=63...
size_x                                                          192
size_y                                                          192
height_m                                                     384000
width_m                                                      384000
!data_min                                                   -23540.1
!data_max                                                     22.877
pct_missing                                                       0
Name: 39505, dtype: object

The minimum value in this case is -23540.1 degrees C, which is strange value. And if we actually look at the minimum in the image stored in SEVIR, we see a value of -18312, which decodes to -183.12. That's different than what's reported above.

Explanation

Looking at the data, this happens when there are a few bad pixels in the image, typically in very high and thick clouds:

bad_ir069

Data is converted to int16 before being written to .h5, however the min/max values entered in the CATALOG are recorded before this casting is done. In cases of bad pixels, these values get very large (as what happened in this case), and the true minimum of the data causes and int16 overflow when scaled. So the pixel value stored for these bad pixels in SEVIR is garbage (as is the value stored in the CATALOG).

Unfortunately, this cannot be fixed easily without recreating the whole dataset. A good practice would be in preprocessing to clip pixels to a physically reasonable range computed by filtering out outliers like this one.