I am doing an experiment of running CONUS for the whole year of 2023 one t at a time. The code failed at t=[2023, 2, 25, "AM"] because that dataframe was empty.
Problem statement
While it is not invalid for there to be no new fire pixels for a particular region+t (#47 ensures that the code agrees that this is valid) it is surprising that there would be a t with no fire pixels for a region as large as CONUS.
Investigation
This led me to look back at the original daily input file:
This image clearly shows that there are in fact no AM pixels for the Americas in that dataset. However the total gap in data over West Africa looked very suspicious to me, so I suspected there was some data missing. To check that assumption we can take a look at the monthly data for that same day:
Here you can see that there ought to be some AM pixels after all.
How to tell when data is missing
The issue with telling when data is missing is that it doesn't have to be fully empty for there to be missing points. Greg suggested that if we want to go ahead and preprocess several years of data we could do a global analysis of what the expected number of pixels is globally on any given day of year. That could work. We could probably also do a rough translation between number of pixels and filesize so that we don't even have to open the files to determine if they are likely to be missing data.
How to correct missing data
Once we know we have missing data we need to check if there is better data upstream, download it, and reprocess anything that already happened.
There are a few ways we can try to get better data:
1) Are there new versions of the NRT files that need to be cloned into the input bucket?
2) Are there monthly files that need to be downloaded?
3) If there are monthly files, are those the ones that were used in the preprocessing?
If we can't get better data, then ideally we tag that timestep in some way to indicate that it is suspect.
If we can get better data then we need to reprocess. Right now this means:
1) rerun preprocess_t with force=True and uploading to s3.
2) rerun preprocess_region_t with force=True and uploading to s3
3) rerun Fire_Forward. There isn't really a good way to force Fire_Forward to only rehydrate from a certain t, but that would be straightforward to add.
Context
I am doing an experiment of running CONUS for the whole year of 2023 one t at a time. The code failed at t=[2023, 2, 25, "AM"] because that dataframe was empty.
Problem statement
While it is not invalid for there to be no new fire pixels for a particular region+t (#47 ensures that the code agrees that this is valid) it is surprising that there would be a t with no fire pixels for a region as large as CONUS.
Investigation
This led me to look back at the original daily input file:
This image clearly shows that there are in fact no AM pixels for the Americas in that dataset. However the total gap in data over West Africa looked very suspicious to me, so I suspected there was some data missing. To check that assumption we can take a look at the monthly data for that same day:
Here you can see that there ought to be some AM pixels after all.
How to tell when data is missing
The issue with telling when data is missing is that it doesn't have to be fully empty for there to be missing points. Greg suggested that if we want to go ahead and preprocess several years of data we could do a global analysis of what the expected number of pixels is globally on any given day of year. That could work. We could probably also do a rough translation between number of pixels and filesize so that we don't even have to open the files to determine if they are likely to be missing data.
How to correct missing data
Once we know we have missing data we need to check if there is better data upstream, download it, and reprocess anything that already happened.
There are a few ways we can try to get better data:
1) Are there new versions of the NRT files that need to be cloned into the input bucket? 2) Are there monthly files that need to be downloaded? 3) If there are monthly files, are those the ones that were used in the preprocessing?
If we can't get better data, then ideally we tag that timestep in some way to indicate that it is suspect.
If we can get better data then we need to reprocess. Right now this means:
1) rerun
preprocess_t
withforce=True
and uploading to s3. 2) rerunpreprocess_region_t
withforce=True
and uploading to s3 3) rerunFire_Forward
. There isn't really a good way to forceFire_Forward
to only rehydrate from a certaint
, but that would be straightforward to add.