Earth-Information-System / fireatlas

3 stars 1 forks source link

Dealing with missing data #48

Open jsignell opened 1 month ago

jsignell commented 1 month ago

Context

I am doing an experiment of running CONUS for the whole year of 2023 one t at a time. The code failed at t=[2023, 2, 25, "AM"] because that dataframe was empty.

Problem statement

While it is not invalid for there to be no new fire pixels for a particular region+t (#47 ensures that the code agrees that this is valid) it is surprising that there would be a t with no fire pixels for a region as large as CONUS.

Investigation

This led me to look back at the original daily input file:

from fireatlas import FireIO
import hvplot.pandas

df = FireIO.read_VNP14IMGTDL("/projects/shared-buckets/gsfc_landslides/FEDSinput/VIIRS/VNP14IMGTDL/SUOMI_VIIRS_C2_Global_VNP14IMGTDL_NRT_2023056.txt")
df = FireIO.AFP_setampm(df)
df.hvplot.points(y="Lat", x="Lon", c="ampm", geo=True, coastline=True, alpha=.2, frame_width=800)

image

This image clearly shows that there are in fact no AM pixels for the Americas in that dataset. However the total gap in data over West Africa looked very suspicious to me, so I suspected there was some data missing. To check that assumption we can take a look at the monthly data for that same day:

from fireatlas import FireIO
import hvplot.pandas

df = FireIO.read_VNP14IMGML("/projects/shared-buckets/gsfc_landslides/FEDSinput/VIIRS/VNP14IMGML/VNP14IMGML.202302.C2.01.txt")
df = df[(df.datetime >= "2023-02-25") & (df.datetime < "2023-02-26")]
df = FireIO.AFP_setampm(df)
df.hvplot.points(y="Lat", x="Lon", c="ampm", geo=True, coastline=True, alpha=.2, frame_width=800)

image

Here you can see that there ought to be some AM pixels after all.

How to tell when data is missing

The issue with telling when data is missing is that it doesn't have to be fully empty for there to be missing points. Greg suggested that if we want to go ahead and preprocess several years of data we could do a global analysis of what the expected number of pixels is globally on any given day of year. That could work. We could probably also do a rough translation between number of pixels and filesize so that we don't even have to open the files to determine if they are likely to be missing data.

How to correct missing data

Once we know we have missing data we need to check if there is better data upstream, download it, and reprocess anything that already happened.

There are a few ways we can try to get better data:

1) Are there new versions of the NRT files that need to be cloned into the input bucket? 2) Are there monthly files that need to be downloaded? 3) If there are monthly files, are those the ones that were used in the preprocessing?

If we can't get better data, then ideally we tag that timestep in some way to indicate that it is suspect.

If we can get better data then we need to reprocess. Right now this means:

1) rerun preprocess_t with force=True and uploading to s3. 2) rerun preprocess_region_t with force=True and uploading to s3 3) rerun Fire_Forward. There isn't really a good way to force Fire_Forward to only rehydrate from a certain t, but that would be straightforward to add.