Data discrepancies between V3 and V2 data in API: Also API not updating past 7/31/2024

mccabete commented 3 months ago

There are some discrepancies in the API coming from:

The data stopped updating after 7/31. In V2 it had been updating, so this isn't an underlying data issue I don't think. This may/may not be related to the issue below.
There are fires that we know have start dates at 7-24-24 that now have start dates at 7-28-24, and smaller fire perimeters (see below).

This is the Park fire, as represented by the V2 perimeters. The start date matches the date reported by the news:

Now, when we query that same area in the V3 perimeters we get two fires that seem to represent the "edges" of the fire, and "start" later:

I think the full history of fires is getting chopped off somehow by the V2 -> V3 merge. I also think that the fire perimeters were generated starting from 7/28, without the history of the fire.

ranchodeluxe commented 3 months ago

Yeah, let's look into this.

we should debug if it's just the API. It might be an issue with how we exporting layers or deciding on the t.
I see data for 2024-08-30 in the CONUS API: https://firenrt.delta-backend.com/collections/public.eis_fire_snapshot_perimeter_nrt/items?limit=100&sortby=-t&f=json&region=CONUS and when I check the backing GDF store that corresponds with what's in there for most recent date

ranchodeluxe commented 3 months ago

1. we should debug if it's just the API. It might be an issue with how we exporting layers or deciding on the `t`.

This doesn't seem to be an export issue or API related. The backing AllFires parques (even for different ted in this script) all show the same thing you are saying. Please double check the work

from fireatlas import FireIO, preprocess,  postprocess
import branca.colormap as cm
import folium
import pandas as pd
import geopandas as gpd
from shapely.geometry import box

tst = [2024, 7, 23, "AM"]  # unused exectp for YEAR
ted = [2024, 8, 4, "AM"]  # this has to match a date for an allfires_<>.parquet file in the output
region = ("CONUS", [])
region = preprocess.read_region(region, location="s3")
gdf = postprocess.read_allfires_gdf(tst, ted, region, location="s3").reset_index()

# only map date ranges we care about
gdf[gdf['t'].dt.date >= pd.to_datetime('2024-07-25').date()].head(3)

minx, miny, maxx, maxy = -123.662109,38.623710,-120.053101,41.045787
bbox = box(minx, miny, maxx, maxy)
bbox_gdf = gpd.GeoDataFrame([[bbox]], columns=['geometry'], crs="EPSG:4326")
bbox_gdf = bbox_gdf.to_crs('EPSG:9311')
hull_gdf = gdf['hull']
intersected_gdf = gpd.clip(hull_gdf, bbox_gdf)
intersected_gdf.explore()

ranchodeluxe commented 3 months ago

Think next step for debugging is to run the whole shebang locally fro the range 2024-07-20 to 2024-08-08 and see if we can better results

mccabete commented 3 months ago

Hmmm I think you are right --- it's a start time issue. I think that the data we are seeing isn't reading in data from before 7-28 for some reason -- which yeah suggests that something is up with the code not the API.

ranchodeluxe commented 3 months ago

Good news! A full run for that small bbox for range 2024-07-20 to 2024-08-08 worked!

mccabete commented 3 months ago

Huh. That is good news! Ok, so the code fundamentally works, the API ingest works. Where is this bug coming from? -- Is there any way an incorrect start date might have made it into the workflow?

ranchodeluxe commented 3 months ago

Huh. That is good news! Ok, so the code fundamentally works, the API ingest works. Where is this bug coming from? -- Is there any way an incorrect start date might have made it into the workflow?

Well, we have to clarify the following things:

maybe your start dates are still wrong in this recent output
the CONUS runs do things in ~4 month batches to beat the 24 time limit. So first we ran from 2024-03-25 to 2024-05-25 (there is an allfires parquet for this) and then from 2024-05-25 to 2024-08-04 (there is an allfires parquet for this). So maybe the pickup logic does something strange we need to weed out

jsignell commented 3 months ago

Ok so here is something. There is data in the NOAA20 input file for July 25th over CONUS, but it was updated up until the midnight ET on the 25th and the preprocessed data was last updated at midnight (ET) of the 24th. So understandably it has almost nothing in it.

mccabete commented 3 months ago

So this is a "We downloaded the data before it was complete , then kept it and didn't re-download it" problem.

mccabete commented 3 months ago

So if we delete the data and re-download it, everything "should" work. Is there any reason not to do that? (ie it would prevent us from debugging further?). We could archive the output where we found the holes.

mccabete commented 3 months ago

Copying from slack

I think it's totally possible that time zones got us on this one. If they make a file before the satellite has completed its orbit (reasonable, lots of places need the data fast), and we download it then, then the carefully avoid re-downloading it, then we will see the same errors.

I'm unclear on why this problem started cropping up after 7/31, because if my "timezone" theory is correct, then our timing would have had to change. There are two commits that I see as potential "gotchas"

My commit 7/31 switching from nrt3 to nrt4: https://github.com/Earth-Information-System/fireatlas/commit/f93b7381ac67739c01e55947832de0d6408200d0

If the LAADS update schedule is different between servers, then maybe the change introduced new timing.

The other commit is the change to the primary key workflows the next day. I can't find the line, but I think it's possible that this changed the chron timing of our data update checker, and then we started getting data from before CONUS data was availible.

mccabete commented 3 months ago

We were able to get v2 outputs on August 1st though. To me, that says this is an us scheduling thing.

ranchodeluxe commented 3 months ago

So if we delete the data and re-download it, everything "should" work. Is there any reason not to do that?

I kicked of a new run for CONUS late on Friday but by Saturday I had somehow corrupted the outputs with my investigation 😞 I kicked off a new run for CONUS Sunday that will not be done until mid Monday

We were able to get v2 outputs on August 1st though. To me, that says this is an us scheduling thing.

Yes, scheduling is a big part of it. But there are two things to think about with the schedules:

"How" the raw NOAA20 inputs get downloaded is the first schedule to think about. v2 was re-downloading the raw satellite data every four hours with a separate job for the current day. v3 tries not to do that for the current day but given that they update a single daily file in batches throughout the day we have to go back to a flow that always reprocesses at least the current day's raw NOAA20 files. This also means the preprocessed cache (/FEDSpreprocessed/NOAA/ and /FEDSpreprocessed/<region>/NOAA20/) will also have to be updated for each of those runs.
When the algorithm runs for the current day is the most important schedule to think about. v2 and v3 algorithms pick up from where they left off and only run forward. So that means it's really crucial per region to have a "best guess" of when to run the algorithm for the current day where it will find the most NOAA20 input pixels. Unfortunately this is tough b/c it's not static and the satellite orbits change.

Regarding 2, in an effort to find reasonable scheduled times without basing it off of NOAA20 orbit predictions (yet) and weed out when data we request is not available/offline I wrote a script to download the current day's data every hour. The output files are available in the pangeo-shared-dps-outputs workspace on the ADE under /projects/NOAA20/. Then I graphed and output some basic info about each hourly update with the code below (graphs timesteps are ordered left-to-right and wrap from there):

from fireatlas import FireIO
from functools import reduce
import hvplot.pandas
import pandas as pd
import geopandas as gpd
import datetime
from shapely.geometry import Point
from shapely.geometry import box

minx, miny, maxx, maxy = -126.401171875,24.071240929282325,-61.36210937500001,49.40003415463647
bbox = box(minx, miny, maxx, maxy)
bbox_conus = gpd.GeoDataFrame([[bbox]], columns=['geometry'], crs="EPSG:4326")

plots = []
start_date = datetime.datetime(2024,8,11,14)
end_date = datetime.datetime(2024,8,12,9)
current_date = start_date
while current_date < end_date:
    try:
        df = FireIO.read_VNP14IMGTDL(f"/projects/NOAA20/file_{current_date.strftime('%Y%m%d_%H')}.txt")
        df['datetimestr'] = df['datetime'].dt.strftime('%Y-%m-%d %H:%M:%S')
        df = FireIO.AFP_setampm(df)
    except pd.errors.EmptyDataError:
        print(f"NOAA20 VIIRS download @ {current_date.strftime('%Y%m%d_%H')} file empty 404")
        current_date = current_date + datetime.timedelta(hours=1)
        continue

    geometry = [Point(xy) for xy in zip(df['Lon'], df['Lat'])]
    gdf = gpd.GeoDataFrame({}, geometry=geometry)
    gdf.set_crs(epsg=4326, inplace=True)
    clip = gpd.clip(gdf,bbox_conus)
    print(f"NOAA20 VIIRS download @ {current_date.strftime('%Y%m%d_%H')} with counts GLOBAL_AM={len(df[df['ampm']=='AM'])}, GLOBAL_PM={len(df[df['ampm']=='PM'])}, CONUS_ALL={len(clip['geometry'])}")

    plots.append(df.hvplot.points(y="Lat", x="Lon", c="ampm", hover_cols=['Lat', 'Lon', 'datetimestr', 'ampm'], geo=True, coastline=True, alpha=.2, frame_width=800))
    current_date = current_date + datetime.timedelta(hours=1)

combined_plot = reduce(lambda x, y: x + y, plots)
combined_plot.opts(
    width=800,
    height=400,
    title=f"",
    legend_position='top_left'
)

NOAA20 VIIRS download @ 20240811_14 with counts GLOBAL_AM=34069, GLOBAL_PM=28024, CONUS_ALL=1978
NOAA20 VIIRS download @ 20240811_15 with counts GLOBAL_AM=34069, GLOBAL_PM=28024, CONUS_ALL=1978
NOAA20 VIIRS download @ 20240811_16 with counts GLOBAL_AM=40841, GLOBAL_PM=45645, CONUS_ALL=1980
NOAA20 VIIRS download @ 20240811_17 with counts GLOBAL_AM=40852, GLOBAL_PM=50139, CONUS_ALL=1980
NOAA20 VIIRS download @ 20240811_18 with counts GLOBAL_AM=41060, GLOBAL_PM=50146, CONUS_ALL=1980
NOAA20 VIIRS download @ 20240811_19 with counts GLOBAL_AM=41060, GLOBAL_PM=50146, CONUS_ALL=1980
NOAA20 VIIRS download @ 20240811_20 with counts GLOBAL_AM=41237, GLOBAL_PM=51200, CONUS_ALL=1980
NOAA20 VIIRS download @ 20240811_21 with counts GLOBAL_AM=42116, GLOBAL_PM=57132, CONUS_ALL=1980
NOAA20 VIIRS download @ 20240811_22 with counts GLOBAL_AM=42406, GLOBAL_PM=59406, CONUS_ALL=2034
NOAA20 VIIRS download @ 20240811_23 with counts GLOBAL_AM=45892, GLOBAL_PM=67589, CONUS_ALL=2430
NOAA20 VIIRS download @ 20240812_00 file empty 404
NOAA20 VIIRS download @ 20240812_01 file empty 404
NOAA20 VIIRS download @ 20240812_02 file empty 404
NOAA20 VIIRS download @ 20240812_03 file empty 404
NOAA20 VIIRS download @ 20240812_04 file empty 404
NOAA20 VIIRS download @ 20240812_05 with counts GLOBAL_AM=71, GLOBAL_PM=6, CONUS_ALL=0
NOAA20 VIIRS download @ 20240812_06 with counts GLOBAL_AM=1049, GLOBAL_PM=42, CONUS_ALL=0
NOAA20 VIIRS download @ 20240812_07 with counts GLOBAL_AM=5756, GLOBAL_PM=4525, CONUS_ALL=0
NOAA20 VIIRS download @ 20240812_08 with counts GLOBAL_AM=5756, GLOBAL_PM=4525, CONUS_ALL=0

conclusion

For CONUS and BorealNA it looks like AM runs should kick off at 15:00 UTC and PM runs at 23:00 UTC

For RussiaEast it looks like PM runs should kick off at 11:00 UTC and AM runs at 23:00 UTC.

mccabete commented 3 months ago

This is great experimentation. It seems like there is a plan for NOAA-20.

Re:

"How" the raw NOAA20 inputs get downloaded is the first schedule to think about. v2 was re-downloading the raw satellite data every four hours with a separate job for the current day. v3 tries not to do that for the current day but given that they update a single daily file in batches throughout the day we have to go back to a flow that always reprocesses at least the current day's raw NOAA20 files. This also means the preprocessed cache (/FEDSpreprocessed/NOAA/ and /FEDSpreprocessed//NOAA20/) will also have to be updated for each of those runs.

I agree. Seems like some daily reprocessing is unavoidable, especially as we add in NOAA-21 or if SNPP comes back, because the alternative is "we wait until the perfect moment when all data is available from satellites with different overpass times" which as you point out is a moving target, depending on changing orbits and processing + downlink time, which could change.

Second, high lat regions will get WAY more overpasses a day, so for highlats the wishlist is FEDS but instead of having two perimeters a day it's however many perimeters are possible given the # of overpasses. That to me says we preprocess the current day (+/- a buffer for time zone headaches), and then use the pre-processed files.

Earth-Information-System / fireatlas

Data discrepancies between V3 and V2 data in API: Also API not updating past 7/31/2024 #127

conclusion