HARPgroup / model_meteorology

0 stars 0 forks source link

Bug: issue with NLDAS not generating precip in small/coastal areas #62

Closed rburghol closed 1 month ago

rburghol commented 4 months ago

Trying to track down the issue with NLDAS not generating precip in the command: ( _Originally posted by @COBrogan in #57_)

/opt/model/meta_model/run_model raster_met PRISM usgs_ws_01660400 auto geo download

The error keeps saying that all pixels in the area have the NoData value. NOTICE: All pixels of band have the NODATA value. But, this does not appear to be the case. If we look at a specific point within the watershed on a day in which we know there is rain in Aquia Creek for usgs_ws_01660400, we can see that NLDAS should have data for this watershed based on the full NLDAS rasters for that day. However, when we clip the raster to just the gage, the clipped raster no longer has data:

SQL ``` select met.featureid, to_timestamp(met.tstime) as obs_date, extract(month from to_timestamp(met.tstime)) as mo, ST_Value(met.rast, ST_SetSRID(ST_Point(-77.5, 38.5),4326)) as fullRastPoint, ST_Value(st_clip(met.rast, 1, fgeo.dh_geofield_geom,9999), ST_SetSRID(ST_Point(-77.5, 38.5),4326)) as clipRastPoint from dh_feature as f left outer join field_data_dh_geofield as fgeo on ( fgeo.entity_id = f.hydroid and fgeo.entity_type = 'dh_feature' ) left outer join dh_variabledefinition as v on ( v.varkey = 'nldas2_obs_hourly' ) left outer join dh_feature as mcov on ( mcov.hydrocode = 'cbp6_met_coverage' ) left outer join dh_timeseries_weather as met on ( mcov.hydroid = met.featureid and met.varid = v.hydroid ) where f.hydrocode = 'usgs_ws_01660400' AND extract(year from to_timestamp(met.tsendtime)) = 2023 AND extract(month from to_timestamp(met.tsendtime)) = 1 AND extract(day from to_timestamp(met.tsendtime)) IN (17) order by met.tstime; ```
featureid obs_date mo fullrastpoint cliprastpoint
617850 2023-01-16 23:00:00-05 1 0
617850 2023-01-17 00:00:00-05 1 0
617850 2023-01-17 01:00:00-05 1 0
617850 2023-01-17 02:00:00-05 1 0
617850 2023-01-17 03:00:00-05 1 0
617850 2023-01-17 04:00:00-05 1 0.0222
617850 2023-01-17 05:00:00-05 1 0.0428
617850 2023-01-17 06:00:00-05 1 0.1352

This is NOT the case for Strausberg. At hydrocode usgs_ws_01634000, we can see that a specific point in the watershed shows equal rainfall in both the full and clipped rasters. This makes me think something has gone wrong with our ST_clip command, but I haven't tracked it down yet

SQL ``` select met.featureid, to_timestamp(met.tstime) as obs_date, extract(month from to_timestamp(met.tstime)) as mo, ST_Value(met.rast, ST_SetSRID(ST_Point(-78.7, 38.8),4326)) as fullRastPoint, ST_Value(st_clip(met.rast, 1, fgeo.dh_geofield_geom,9999), ST_SetSRID(ST_Point(-78.7, 38.8),4326)) as clipRastPoint from dh_feature as f left outer join field_data_dh_geofield as fgeo on ( fgeo.entity_id = f.hydroid and fgeo.entity_type = 'dh_feature' ) left outer join dh_variabledefinition as v on ( v.varkey = 'nldas2_obs_hourly' ) left outer join dh_feature as mcov on ( mcov.hydrocode = 'cbp6_met_coverage' ) left outer join dh_timeseries_weather as met on ( mcov.hydroid = met.featureid and met.varid = v.hydroid ) where f.hydrocode = 'usgs_ws_01634000' AND extract(year from to_timestamp(met.tsendtime)) = 2023 AND extract(month from to_timestamp(met.tsendtime)) = 1 AND extract(day from to_timestamp(met.tsendtime)) IN (17) order by met.tstime; ```
featureid obs_date mo fullrastpoint cliprastpoint
617850 2023-01-16 23:00:00-05 1 0 0
617850 2023-01-17 00:00:00-05 1 0 0
617850 2023-01-17 01:00:00-05 1 0.0002 0.0002
617850 2023-01-17 02:00:00-05 1 0.2112 0.2112
617850 2023-01-17 03:00:00-05 1 0.3976 0.3976
617850 2023-01-17 04:00:00-05 1 0.4244 0.4244
617850 2023-01-17 05:00:00-05 1 0.212 0.212
617850 2023-01-17 06:00:00-05 1 0.2228 0.2228

It's clear that our clip at the very least generates a raster of the shape that we'd generally expect:

POLYGON((-77.6255 38.6255,-77.3755 38.6255,-77.3755 38.3755,-77.6255 38.3755,-77.6255 38.6255))
SQL ``` select ST_AsText(ST_Envelope(met.rast)), ST_AsText(ST_Envelope(st_clip(met.rast, 1, fgeo.dh_geofield_geom,9999,TRUE))) from dh_feature as f left outer join field_data_dh_geofield as fgeo on ( fgeo.entity_id = f.hydroid and fgeo.entity_type = 'dh_feature' ) left outer join dh_variabledefinition as v on ( v.varkey = 'nldas2_obs_hourly' ) left outer join dh_feature as mcov on ( mcov.hydrocode = 'cbp6_met_coverage' ) left outer join dh_timeseries_weather as met on ( mcov.hydroid = met.featureid and met.varid = v.hydroid ) where f.hydrocode = 'usgs_ws_01660400' AND extract(year from to_timestamp(met.tsendtime)) = 2023 AND extract(month from to_timestamp(met.tsendtime)) = 1 AND extract(day from to_timestamp(met.tsendtime)) IN (17) order by met.tstime LIMIT 1; ```

With some more research, my guess is that ST_clip doesn't grab partial cells from the raster - that is, it only grabs cells that are contained in the file or at least that the centroid is in. See here. My gage is in between several cells. The below image shows only the extent of the clipped raster from above on January 17th and shows that our site doesn't contain any full pixels:

image

If I use ST_intersection I start to see values at our site. I imagine I would get similar results with ST_DumpAsPolygons but I'm guessing that will be computationally slow. ST_intersection is already pretty slow based on my testing:

SQL ``` select met.featureid, to_timestamp(met.tstime) as obs_date, extract(month from to_timestamp(met.tstime)) as mo, (ST_Intersection(met.rast, fgeo.dh_geofield_geom)).val AS clipVal FROM dh_feature as f LEFT JOIN field_data_dh_geofield as fgeo ON ( fgeo.entity_id = f.hydroid and fgeo.entity_type = 'dh_feature' ) LEFT JOIN dh_variabledefinition as v ON ( v.varkey = 'nldas2_obs_hourly' ) LEFT JOIN dh_feature as mcov ON ( mcov.hydrocode = 'cbp6_met_coverage' ) LEFT JOIN dh_timeseries_weather as met ON ( mcov.hydroid = met.featureid and met.varid = v.hydroid ) WHERE f.hydrocode = 'usgs_ws_01660400' AND extract(year from to_timestamp(met.tsendtime)) = 2023 AND extract(month from to_timestamp(met.tsendtime)) = 1 AND extract(day from to_timestamp(met.tsendtime)) IN (17) ORDER BY met.tstime; ```
featureid obs_date mo clipval
617850 2023-01-16 23:00:00-05 1 0
617850 2023-01-17 00:00:00-05 1 0
617850 2023-01-17 01:00:00-05 1 0
617850 2023-01-17 02:00:00-05 1 0
617850 2023-01-17 03:00:00-05 1 0
617850 2023-01-17 04:00:00-05 1 0.1597999930381775
617850 2023-01-17 04:00:00-05 1 0.04280000180006027
617850 2023-01-17 04:00:00-05 1 0.07980000227689743
617850 2023-01-17 04:00:00-05 1 0.022199999541044235
617850 2023-01-17 05:00:00-05 1 0.20419999957084656
617850 2023-01-17 05:00:00-05 1 0.06920000165700912
617850 2023-01-17 05:00:00-05 1 0.09019999951124191
617850 2023-01-17 05:00:00-05 1 0.04280000180006027
617850 2023-01-17 06:00:00-05 1 0.42640000581741333
rburghol commented 4 months ago

@Cobrogan the docs on st_clip, to my read (I could be wrong) suggest that intersection is implicit in st_clip: https://postgis.net/docs/RT_ST_Clip.html

image

Either way, we can control that behavior in how we call st_clip, though still, the st_clip vs st_intersects could be an important distinction, but I am troubled by the image you showed.

Since the Culpepper gage contains the Ruckersville gage, I would expect there to be many more cells overlapping than what you're showing -- I might just be getting confused by the image blocks tho.

That said, at the every least I would expect the Culpepper gage to be less likely than the Ruckersville to have no data problems. I suspect maybe I goofed something up in the query during some recent tidy sessions.

Which makes me wonder: what do the daymet and prism containments look like?

rburghol commented 4 months ago

From @COBrogan I'll draft up some figures showing daymet and prism containments. I've tried messing around with ST_clip, but setting crop to TRUE or FALSE yield the same results. My understanding is it just affects the extent of the raster that is returned, with "NoData" value being used to populate cells outside of the intersection. ST_clip is the more efficient version of ST_intersection, but I still haven't found how to get the bordering cells to be included.

Interestingly, resampling the raster to the PRISM resolution also yielded no results. So, these figures showing the containments should help identify if something is going wrong in ST_clip or if we are facing a resolution problem:

ST_Resample SQL Details ``` WITH prism AS ( select met.rast FROM dh_feature as f LEFT JOIN field_data_dh_geofield as fgeo ON (fgeo.entity_id = f.hydroid and fgeo.entity_type = 'dh_feature') LEFT JOIN dh_variabledefinition as v ON (v.varkey = 'prism_mod_daily') LEFT JOIN dh_feature as mcov ON (mcov.hydrocode = 'cbp6_met_coverage') LEFT JOIN dh_timeseries_weather as met ON (mcov.hydroid = met.featureid and met.varid = v.hydroid) LIMIT 1 ) select met.featureid, to_timestamp(met.tstime) as obs_date, extract(month from to_timestamp(met.tstime)) as mo, ST_Value(met.rast, ST_SetSRID(ST_Point(-77.6, 38.55),4326)) as fullRastPoint, ST_Value(st_clip(ST_Resample(met.rast,(SELECT rast FROM prism)), 1, fgeo.dh_geofield_geom,9999,TRUE), ST_SetSRID(ST_Point(-77.6, 38.55),4326)) as clipRastPoint FROM dh_feature as f LEFT JOIN field_data_dh_geofield as fgeo ON (fgeo.entity_id = f.hydroid and fgeo.entity_type = 'dh_feature') LEFT JOIN dh_variabledefinition as v ON (v.varkey = 'nldas2_obs_hourly') LEFT JOIN dh_feature as mcov ON (mcov.hydrocode = 'cbp6_met_coverage') LEFT JOIN dh_timeseries_weather as met ON (mcov.hydroid = met.featureid and met.varid = v.hydroid) WHERE f.hydrocode = 'usgs_ws_01660400' AND extract(year from to_timestamp(met.tsendtime)) = 2023 AND extract(month from to_timestamp(met.tsendtime)) = 1 AND extract(day from to_timestamp(met.tsendtime)) IN (17) ORDER BY met.tstime; ```
rburghol commented 4 months ago

OK - resampling is a good idea, but I am still unclear why this is happening at Culpepper but not Ruckersville. Wonder if something is wrong with the geometry? Like self-intersection or something? The intersection algorithm is very inclusive, it literally just says there is some overlap.

Oh, one question: does "fullrasterpoint" actually just mean a single raster cell value intersecting the point you specified?

Plus @COBrogan I really think it is impossible that Culpepper contains no full pixels -- it's a drainage area of 468sqmi, it has to contain some 15x15km cells... maybe our projection is off, below is USGS map: image

rburghol commented 4 months ago

From @COBrogan Well that's one mystery solved. I've been looking at 01660400, which is Aquia Creek Near Garrisonville. And has a DA of only 35 sq mi. See here for its NWIS page. So it is NOT the Culpepper gage that is giving us trouble, but something much smaller. My bad on the Wild Goose chase there.

As for the potential conflicts in ST_clip for smaller watersheds, see below. Dashed blue is daymet, red is PRISM, and solid black is NLDAS. We can see that there are no NLDAS cells fully contained in the watershed so ST_clip() will not return values for NLDAS. test2

rburghol commented 4 months ago

OK cool, thanks for sorting that out!! @COBrogan

The thing is though, fully contained is not the way it is supposed to be happening. Intersect literally just means they're boundaries cross.

But, re-sampling to the smallest grid we have should do the trick.

Still troubling, however, is that @ilonah22 reported getting no data values for Fredericksburg…

rburghol commented 4 months ago

I agree with your assertion, @rburghol . Maybe something else IS going on. First, while I've seen other users on SO with the same problem, I've yet to see official documentation of ST_clip() that implies this is the intended behavior. Second, the resampling failed to bring results. And, switching to daymet below, my query is returns logical results which indicates that my testing method works:

SQL Code for Daymet single point extraction before/after clip ``` select met.featureid, to_timestamp(met.tstime) as obs_date, extract(month from to_timestamp(met.tstime)) as mo, ST_Value(met.rast, ST_SetSRID(ST_Point(-77.5 38.5),4326)) as fullRastPoint, ST_Value(st_clip(met.rast, 1, fgeo.dh_geofield_geom,9999,FALSE), ST_SetSRID(ST_Point(-77.5 38.5),4326)) as clipRastPoint FROM dh_feature as f LEFT JOIN field_data_dh_geofield as fgeo ON ( fgeo.entity_id = f.hydroid and fgeo.entity_type = 'dh_feature' ) LEFT JOIN dh_variabledefinition as v ON ( v.varkey = 'daymet_mod_daily' ) LEFT JOIN dh_feature as mcov ON ( mcov.hydrocode = 'cbp6_met_coverage' ) LEFT JOIN dh_timeseries_weather as met ON ( mcov.hydroid = met.featureid and met.varid = v.hydroid ) WHERE f.hydrocode = 'usgs_ws_01660400' AND extract(year from to_timestamp(met.tsendtime)) = 2023 AND extract(month from to_timestamp(met.tsendtime)) = 1 AND extract(day from to_timestamp(met.tsendtime)) IN (17) ORDER BY met.tstime; ```
featureid obs_date mo fullrastpoint cliprastpoint
617850 2023-01-16 07:00:00-05 1 1.0297999382019043 1.0297999382019043
rburghol commented 4 months ago

@rburghol ......I was using an incorrect set of coordinates. Resampling the NLDAS raster to PRISM results in me getting the same results at a given point. WIthout resampling, I get no data.

Details ``` WITH refRast AS ( select met.rast FROM dh_feature as f LEFT JOIN field_data_dh_geofield as fgeo ON (fgeo.entity_id = f.hydroid and fgeo.entity_type = 'dh_feature') LEFT JOIN dh_variabledefinition as v ON (v.varkey = 'prism_mod_daily') LEFT JOIN dh_feature as mcov ON (mcov.hydrocode = 'cbp6_met_coverage') LEFT JOIN dh_timeseries_weather as met ON (mcov.hydroid = met.featureid and met.varid = v.hydroid) LIMIT 1 ) select met.featureid, to_timestamp(met.tstime) as obs_date, extract(month from to_timestamp(met.tstime)) as mo, ST_Value(met.rast, ST_SetSRID(ST_Point(-77.5, 38.5),4326)) as fullRastPoint, ST_Value(st_clip(ST_Resample(met.rast,(SELECT rast FROM refRast)), 1, fgeo.dh_geofield_geom,9999,TRUE), ST_SetSRID(ST_Point(-77.5, 38.5),4326)) as resampleClipRast, ST_Value(st_clip(met.rast, 1, fgeo.dh_geofield_geom,9999,TRUE), ST_SetSRID(ST_Point(-77.5, 38.5),4326)) as noResampleClipRast FROM dh_feature as f LEFT JOIN field_data_dh_geofield as fgeo ON (fgeo.entity_id = f.hydroid and fgeo.entity_type = 'dh_feature') LEFT JOIN dh_variabledefinition as v ON (v.varkey = 'nldas2_obs_hourly') LEFT JOIN dh_feature as mcov ON (mcov.hydrocode = 'cbp6_met_coverage') LEFT JOIN dh_timeseries_weather as met ON (mcov.hydroid = met.featureid and met.varid = v.hydroid) WHERE f.hydrocode = 'usgs_ws_01660400' AND extract(year from to_timestamp(met.tsendtime)) = 2023 AND extract(month from to_timestamp(met.tsendtime)) = 1 AND extract(day from to_timestamp(met.tsendtime)) IN (17) ORDER BY met.tstime; ```
featureid obs_date mo fullrastpoint resamplecliprast noresamplecliprast
617850 2023-01-16 23:00:00-05 1 0 0
617850 2023-01-17 00:00:00-05 1 0 0
617850 2023-01-17 01:00:00-05 1 0 0
617850 2023-01-17 02:00:00-05 1 0 0
617850 2023-01-17 03:00:00-05 1 0 0
617850 2023-01-17 04:00:00-05 1 0.0222 0.0222
617850 2023-01-17 05:00:00-05 1 0.0428 0.0428
617850 2023-01-17 06:00:00-05 1 0.1352 0.1352

Taking this a step further, we can successfully get mean precipitation from the resampled raster:

Details ``` WITH refRast AS ( select met.rast FROM dh_feature as f LEFT JOIN field_data_dh_geofield as fgeo ON (fgeo.entity_id = f.hydroid and fgeo.entity_type = 'dh_feature') LEFT JOIN dh_variabledefinition as v ON (v.varkey = 'prism_mod_daily') LEFT JOIN dh_feature as mcov ON (mcov.hydrocode = 'cbp6_met_coverage') LEFT JOIN dh_timeseries_weather as met ON (mcov.hydroid = met.featureid and met.varid = v.hydroid) LIMIT 1 ) select met.featureid, to_timestamp(met.tstime) as obs_date, extract(month from to_timestamp(met.tstime)) as mo, (ST_summarystats(st_clip(ST_Resample(met.rast,(SELECT rast FROM refRast)), fgeo.dh_geofield_geom), 1, TRUE)).mean as precip_kgm3 from dh_feature as f left outer join field_data_dh_geofield as fgeo on ( fgeo.entity_id = f.hydroid and fgeo.entity_type = 'dh_feature' ) left outer join dh_variabledefinition as v on ( v.varkey = 'nldas2_obs_hourly' ) left outer join dh_feature as mcov on ( mcov.hydrocode = 'cbp6_met_coverage' ) left outer join dh_timeseries_weather as met on ( mcov.hydroid = met.featureid and met.varid = v.hydroid ) where f.hydrocode = 'usgs_ws_01660400' AND extract(year from to_timestamp(met.tsendtime)) = 2023 AND extract(month from to_timestamp(met.tsendtime)) = 1 AND extract(day from to_timestamp(met.tsendtime)) IN (16,17,19,20,21) order by met.tstime; ```
featureid obs_date mo precip_kgm3
617850 2023-01-17 00:00:00-05 1 0
617850 2023-01-17 01:00:00-05 1 0
617850 2023-01-17 02:00:00-05 1 0
617850 2023-01-17 03:00:00-05 1 0
617850 2023-01-17 04:00:00-05 1 0.07275999999999999
617850 2023-01-17 05:00:00-05 1 0.09404
617850 2023-01-17 06:00:00-05 1 0.22464
617850 2023-01-17 07:00:00-05 1 0.4828799999999999
617850 2023-01-17 08:00:00-05 1 0.6930399999999999

Looking at a second watershed that has previously given us trouble brings us to usgs_ws_01669000 PISCATAWAY CREEK NEAR TAPPAHANNOCK, VA: test2

Once again, our typical NLDAS summary method using ST_clip returns no data. But, it does as soon as we resample it to the PRISM resolution:

Gage 01669000 Coverage NLDAS Precip Summary ``` WITH refRast AS ( select met.rast FROM dh_feature as f LEFT JOIN field_data_dh_geofield as fgeo ON (fgeo.entity_id = f.hydroid and fgeo.entity_type = 'dh_feature') LEFT JOIN dh_variabledefinition as v ON (v.varkey = 'prism_mod_daily') LEFT JOIN dh_feature as mcov ON (mcov.hydrocode = 'cbp6_met_coverage') LEFT JOIN dh_timeseries_weather as met ON (mcov.hydroid = met.featureid and met.varid = v.hydroid) LIMIT 1 ) select met.featureid, to_timestamp(met.tstime) as obs_date, extract(month from to_timestamp(met.tstime)) as mo, (ST_summarystats(st_clip(ST_Resample(met.rast,(SELECT rast FROM refRast)), fgeo.dh_geofield_geom), 1, TRUE)).mean as resampleMeanPrecip, (ST_summarystats(st_clip(met.rast, fgeo.dh_geofield_geom), 1, TRUE)).mean as noResampleMeanPrecip from dh_feature as f left outer join field_data_dh_geofield as fgeo on ( fgeo.entity_id = f.hydroid and fgeo.entity_type = 'dh_feature' ) left outer join dh_variabledefinition as v on ( v.varkey = 'nldas2_obs_hourly' ) left outer join dh_feature as mcov on ( mcov.hydrocode = 'cbp6_met_coverage' ) left outer join dh_timeseries_weather as met on ( mcov.hydroid = met.featureid and met.varid = v.hydroid ) where f.hydrocode = 'usgs_ws_01669000' AND extract(year from to_timestamp(met.tsendtime)) = 2023 AND extract(month from to_timestamp(met.tsendtime)) = 1 AND extract(day from to_timestamp(met.tsendtime)) IN (16,17,19,20,21) order by met.tstime; ```
featureid obs_date mo resamplemeanprecip noresamplemeanprecip
617850 2023-01-17 05:00:00-05 1 0
617850 2023-01-17 06:00:00-05 1 0.012
617850 2023-01-17 07:00:00-05 1 0
617850 2023-01-17 08:00:00-05 1 0.1956
617850 2023-01-17 09:00:00-05 1 0.03
617850 2023-01-17 10:00:00-05 1 0.884
617850 2023-01-17 11:00:00-05 1 0.559
617850 2023-01-17 12:00:00-05 1 0.0688
617850 2023-01-17 13:00:00-05 1 0.1256
rburghol commented 4 months ago

Awesome @COBrogan -- that looks really encouraging!! Also we have to resample, anyway, so this is doubly great.

I am curious, though, could you run your NLDS to query again and include a SummaryStatsAgg(st_clip(...).count function? I want to see if it is returning 0 points, meaning the intersection doesn't work as I think it should, or if it is the combination of null and non-null values in an intersection that is resolving as null.

rburghol commented 4 months ago

Sure, rerunning the above query for Piscataway Creek using count, I get the same warning that NOTICE: All pixels of band have the NODATA value. Looking at the results below, we see the count for NLDAS prior to resampling is 0. Looking at the above figure, the 4 count for PRISM also makes it seem like ST_Clip is only getting cells that have a center point that intersects with the boundary.

SQL for Count ``` WITH refRast AS ( select met.rast FROM dh_feature as f LEFT JOIN field_data_dh_geofield as fgeo ON (fgeo.entity_id = f.hydroid and fgeo.entity_type = 'dh_feature') LEFT JOIN dh_variabledefinition as v ON (v.varkey = 'prism_mod_daily') LEFT JOIN dh_feature as mcov ON (mcov.hydrocode = 'cbp6_met_coverage') LEFT JOIN dh_timeseries_weather as met ON (mcov.hydroid = met.featureid and met.varid = v.hydroid) LIMIT 1 ) select met.featureid, to_timestamp(met.tstime) as obs_date, extract(month from to_timestamp(met.tstime)) as mo, (ST_summarystats(st_clip(ST_Resample(met.rast,(SELECT rast FROM refRast)), fgeo.dh_geofield_geom), 1, TRUE)).count as resampleMeanPrecip, (ST_summarystats(st_clip(met.rast, fgeo.dh_geofield_geom), 1, TRUE)).count as noResampleMeanPrecip from dh_feature as f left outer join field_data_dh_geofield as fgeo on ( fgeo.entity_id = f.hydroid and fgeo.entity_type = 'dh_feature' ) left outer join dh_variabledefinition as v on ( v.varkey = 'nldas2_obs_hourly' ) left outer join dh_feature as mcov on ( mcov.hydrocode = 'cbp6_met_coverage' ) left outer join dh_timeseries_weather as met on ( mcov.hydroid = met.featureid and met.varid = v.hydroid ) where f.hydrocode = 'usgs_ws_01669000' AND extract(year from to_timestamp(met.tsendtime)) = 2023 AND extract(month from to_timestamp(met.tsendtime)) = 1 AND extract(day from to_timestamp(met.tsendtime)) IN (16,17,19,20,21) order by met.tstime; ```
featureid obs_date mo resamplemeanprecip noresamplemeanprecip
617850 2023-01-15 23:00:00-05 1 4 0
617850 2023-01-16 00:00:00-05 1 4 0
617850 2023-01-16 01:00:00-05 1 4 0
617850 2023-01-16 02:00:00-05 1 4 0

One potential fix could be convert the rasters into polygons that represent the cell size and resolution. We would only need one copy of this polygon file for each data source. We then run all intersections/clips on the polygons and then use the extent of the polygon intersection as a mask with which to clip our rasters. I'm going to see if this gives us more logical results.

rburghol commented 4 months ago

ST_Clip is only getting cells that have a center point that intersects with the boundary

@COBrogan I think the behavior is contained by the boundary not intersects with - do you concur?

rburghol commented 4 months ago

@COBrogan thanks for doing the ST_summarystats() version, looks like a clear case of small coverage as you had hypothesized, too small to contain the centroid of any of the raster cells, and evidence to suggest that it is indeed the centroid of the raster cell that counts in these summary queries. Let's figure out where in the workflow we need to blow thes rasters out to the daymet resolution and proceed.

COBrogan commented 4 months ago

Sounds good. I think we just need to update the calc_raster_ts file for the geo download step. I'm thinking of maybe doing the following, which uses WITH to create a pre-clipped raster from which we can resample all subsequent NLDAS rasters:

\set band '1'
\set varkey 'nldas2_obs_hourly'
\set hydrocode  'usgs_ws_01668000'

copy (  
WITH refRast AS (
    select st_clip(met.rast,fgeo.dh_geofield_geom)
    FROM dh_feature as f
    LEFT JOIN field_data_dh_geofield as fgeo
        ON (fgeo.entity_id = f.hydroid
        and fgeo.entity_type = 'dh_feature')
    LEFT JOIN dh_variabledefinition as v
        ON (v.varkey = 'prism_mod_daily')
    LEFT JOIN dh_feature as mcov
        ON (mcov.hydrocode = 'cbp6_met_coverage')
    LEFT JOIN dh_timeseries_weather as met
        ON (mcov.hydroid = met.featureid and met.varid = v.hydroid)
    LIMIT 1
)
select met.featureid, to_timestamp(met.tsendtime) as obs_date, 
    extract(month from to_timestamp(met.tsendtime)) as mo,
    (ST_summarystats(ST_Resample(met.rast,(SELECT rast FROM refRast)), :band, TRUE)).mean as precip_mm
  from dh_feature as f
  left outer join field_data_dh_geofield as fgeo
  on (
    fgeo.entity_id = f.hydroid
    and fgeo.entity_type = 'dh_feature'
  )
  left outer join dh_variabledefinition as v
  on (
    v.varkey = :'varkey'
  )
  left outer join dh_feature as mcov
  on (
    mcov.hydrocode = 'cbp6_met_coverage'
  )
  left outer join dh_timeseries_weather as met
  on (
    mcov.hydroid = met.featureid and met.varid = v.hydroid
  )
  where f.hydrocode = :'hydrocode'
  order by met.tsendtime
) to '/tmp/tryNewNLDAS.csv' WITH HEADER CSV;
rburghol commented 3 months ago

@COBrogan Thanks a bunch for keeping this on our radar. I just reviewed this in light of our bug discussion today and your remarks about the processing times. I realize I scooted before you could finish up and so I never found out how long it took for the Acquia Creek subwatershed resample + intersection.

Can you annotate how long that took, and also did you use the USING approach that you outlined above (looked like a promising one to me). Also, in case we need alternatives that are more time efficient I am thinking this might be worth consideration:

COBrogan commented 3 months ago

@rburghol I actually had to cancel my query last night. It had taken over 20 minutes and had yet to produce values. I just reran it now using only 2022 and it took about 6 minutes to run. It's just a lot of raster manipulation, given the hourly nature of NLDAS and it seems this resample function is pretty slow (or at least how I'm using it). I can try running the full dataset again and see if its like 1 or 2 hours or like even more but just projecting out, if 1 year takes 6 minutes than the full dataset is likely to take closer to 4 hours.

Create an entry in the dh_timeseries_weather table with the NLDAS2 (or other) raster clipped to some bounding box that is sufficiently large to insure we get at least 4 cells (there are buffer() commands like any spatial GIS to increase the size, or we can try to create a rectangle around the centroid of a given size).

The bounding box for these entries to dh_timeseries_weather would be specific coverages, right? Like, in the geo workflow, these would be cropped to a buffered USGS gage drainage area? I think this may be the better way to go given how slow my method appears to be (will update this later with the full run-time). I think we could just buffer the gage drainage area by 20 km in all directions. That should capture the relevant cells, right?

rburghol commented 3 months ago

@COBrogan 20km would cover it, but I'd prefer to use something determinate, rather than arbitrarily large. There are 4 functions in postgis that get at the cell size of a given raster:

I believe that if they are not already in the same projection as our coverage, we can reproject the raster before calling st_pixelwidth(), so we can say:

select st_buffer(st_centroid(st_transform(coverage.dh_geofield_geom,4326)), 4.0 * st_pixelwidth(rast)) ...

Alternatively, we can set the buffer distance as a variable in our configuration, and make sure that we choose wisely for each data set.

COBrogan commented 3 months ago

@rburghol That makes sense to me. This methodology using the pixel or scale dimensions are certainly preferable because we may want to do the same thing for PRISM, which would not need a 20 km buffer. Looking at the example you posted regarding the buffer, is there an actual need to use st_centroid? I think st_buffer can take a geometry directly based on the documentation select st_buffer(st_centroid(st_transform(coverage.dh_geofield_geom,4326)), 4.0 * st_pixelwidth(rast)) ...

If you want to keep focusing on flushing out the geo workflow for simple_lm, I can try to get something working for this. But, I'm not sure where this would go in our met workflow. If we are applying this to specific coverages, how would met know where to find those coverages? I'm assuming these will also be stored under the same variable keys right e.g. prism_mod_daily, daymet_mod_daily, nldas_obs_hourly? Using a different key may help distinguish them as resampled, but I think it would be confusing overall and would mess with our geo workflow

Adding for documentation as I was unfamiliar with raster skew: https://en.wikipedia.org/wiki/World_file Difference between st_PixelHeight and st_ScaleY is that st_PixelHeight uses the skew/rotation of the raster to solve for the hypotenuse of the skew and the scale, leading the slightly longer true pixel height.

rburghol commented 3 months ago

@COBrogan I like your division of labor. Responses to your Qs:

rburghol commented 3 months ago

Hey @COBrogan I just had a thought: how about rather than altering calc_raster_ts, we make this a separate data source, like nldas_1km, and include the reprojection step there. For the base NLDAS2 data, if a small watershed like this has no data, then we can just include logic to disregard null values when analyzing, or set the rating to 0.0 for every timestep in the simulation?

If we do it that way, then later, we can come back and use the nldas_1km data source (or perhaps it is an analytical method) as part of our amalgamation? Otherwise we're kinda introducing different data processing methods for small watersheds, making a blurring of the distinction between our sources and methods.

COBrogan commented 3 months ago

First try:

\set band '1'
\set varkey 'nldas2_obs_hourly'
\set hydrocode  'usgs_ws_01668000'

copy (  
WITH refRast AS (
    select st_clip(met.rast,fgeo.dh_geofield_geom)
    FROM dh_feature as f
    LEFT JOIN field_data_dh_geofield as fgeo
        ON (fgeo.entity_id = f.hydroid
        and fgeo.entity_type = 'dh_feature')
    LEFT JOIN dh_variabledefinition as v
        ON (v.varkey = 'daymet_mod_daily')
    LEFT JOIN dh_feature as mcov
        ON (mcov.hydrocode = 'cbp6_met_coverage')
    LEFT JOIN dh_timeseries_weather as met
        ON (mcov.hydroid = met.featureid and met.varid = v.hydroid)
    LIMIT 1
)
select met.featureid, to_timestamp(met.tsendtime) as obs_date, 
    extract(month from to_timestamp(met.tsendtime)) as mo,
    extract(year from to_timestamp(met.tsendtime)) as yr,
    (ST_summarystats(
        ST_Resample(
            ST_Clip(met.rast,st_buffer(fgeo.dh_geofield_geom, 4.0 * st_pixelwidth(met.rast))),
            (SELECT rast FROM refRast)), 
        :band, TRUE)).mean as precip_mm
  from dh_feature as f
  left outer join field_data_dh_geofield as fgeo
  on (
    fgeo.entity_id = f.hydroid
    and fgeo.entity_type = 'dh_feature'
  )
  left outer join dh_variabledefinition as v
  on (
    v.varkey = :'varkey'
  )
  left outer join dh_feature as mcov
  on (
    mcov.hydrocode = 'cbp6_met_coverage'
  )
  left outer join dh_timeseries_weather as met
  on (
    mcov.hydroid = met.featureid and met.varid = v.hydroid
  )
  where f.hydrocode = :'hydrocode'
--  AND extract(year from to_timestamp(met.tsendtime)) = 2022
  order by met.tsendtime
) to '/tmp/tryNewNLDAS.csv' WITH HEADER CSV;
rburghol commented 1 month ago

This issue has been addressed in various forms, as it is indeed a resolution issue. See #87