Closed rburghol closed 1 month ago
@Cobrogan the docs on st_clip, to my read (I could be wrong) suggest that intersection is implicit in st_clip: https://postgis.net/docs/RT_ST_Clip.html
Either way, we can control that behavior in how we call st_clip, though still, the st_clip vs st_intersects could be an important distinction, but I am troubled by the image you showed.
Since the Culpepper gage contains the Ruckersville gage, I would expect there to be many more cells overlapping than what you're showing -- I might just be getting confused by the image blocks tho.
That said, at the every least I would expect the Culpepper gage to be less likely than the Ruckersville to have no data problems. I suspect maybe I goofed something up in the query during some recent tidy sessions.
Which makes me wonder: what do the daymet
and prism
containments look like?
From @COBrogan I'll draft up some figures showing daymet
and prism
containments. I've tried messing around with ST_clip
, but setting crop
to TRUE
or FALSE
yield the same results. My understanding is it just affects the extent of the raster that is returned, with "NoData" value being used to populate cells outside of the intersection. ST_clip
is the more efficient version of ST_intersection
, but I still haven't found how to get the bordering cells to be included.
Interestingly, resampling the raster to the PRISM resolution also yielded no results. So, these figures showing the containments should help identify if something is going wrong in ST_clip or if we are facing a resolution problem:
OK - resampling is a good idea, but I am still unclear why this is happening at Culpepper but not Ruckersville. Wonder if something is wrong with the geometry? Like self-intersection or something? The intersection algorithm is very inclusive, it literally just says there is some overlap.
Oh, one question: does "fullrasterpoint" actually just mean a single raster cell value intersecting the point you specified?
Plus @COBrogan I really think it is impossible that Culpepper contains no full pixels -- it's a drainage area of 468sqmi, it has to contain some 15x15km cells... maybe our projection is off, below is USGS map:
From @COBrogan Well that's one mystery solved. I've been looking at 01660400, which is Aquia Creek Near Garrisonville. And has a DA of only 35 sq mi. See here for its NWIS page. So it is NOT the Culpepper gage that is giving us trouble, but something much smaller. My bad on the Wild Goose chase there.
As for the potential conflicts in ST_clip for smaller watersheds, see below. Dashed blue is daymet, red is PRISM, and solid black is NLDAS. We can see that there are no NLDAS cells fully contained in the watershed so ST_clip() will not return values for NLDAS.
OK cool, thanks for sorting that out!! @COBrogan
The thing is though, fully contained is not the way it is supposed to be happening. Intersect literally just means they're boundaries cross.
But, re-sampling to the smallest grid we have should do the trick.
Still troubling, however, is that @ilonah22 reported getting no data values for Fredericksburg…
I agree with your assertion, @rburghol . Maybe something else IS going on. First, while I've seen other users on SO with the same problem, I've yet to see official documentation of ST_clip() that implies this is the intended behavior. Second, the resampling failed to bring results. And, switching to daymet below, my query is returns logical results which indicates that my testing method works:
featureid | obs_date | mo | fullrastpoint | cliprastpoint |
---|---|---|---|---|
617850 | 2023-01-16 07:00:00-05 | 1 | 1.0297999382019043 | 1.0297999382019043 |
@rburghol ......I was using an incorrect set of coordinates. Resampling the NLDAS raster to PRISM results in me getting the same results at a given point. WIthout resampling, I get no data.
featureid | obs_date | mo | fullrastpoint | resamplecliprast | noresamplecliprast |
---|---|---|---|---|---|
617850 | 2023-01-16 23:00:00-05 | 1 | 0 | 0 | |
617850 | 2023-01-17 00:00:00-05 | 1 | 0 | 0 | |
617850 | 2023-01-17 01:00:00-05 | 1 | 0 | 0 | |
617850 | 2023-01-17 02:00:00-05 | 1 | 0 | 0 | |
617850 | 2023-01-17 03:00:00-05 | 1 | 0 | 0 | |
617850 | 2023-01-17 04:00:00-05 | 1 | 0.0222 | 0.0222 | |
617850 | 2023-01-17 05:00:00-05 | 1 | 0.0428 | 0.0428 | |
617850 | 2023-01-17 06:00:00-05 | 1 | 0.1352 | 0.1352 |
Taking this a step further, we can successfully get mean precipitation from the resampled raster:
featureid | obs_date | mo | precip_kgm3 |
---|---|---|---|
617850 | 2023-01-17 00:00:00-05 | 1 | 0 |
617850 | 2023-01-17 01:00:00-05 | 1 | 0 |
617850 | 2023-01-17 02:00:00-05 | 1 | 0 |
617850 | 2023-01-17 03:00:00-05 | 1 | 0 |
617850 | 2023-01-17 04:00:00-05 | 1 | 0.07275999999999999 |
617850 | 2023-01-17 05:00:00-05 | 1 | 0.09404 |
617850 | 2023-01-17 06:00:00-05 | 1 | 0.22464 |
617850 | 2023-01-17 07:00:00-05 | 1 | 0.4828799999999999 |
617850 | 2023-01-17 08:00:00-05 | 1 | 0.6930399999999999 |
Looking at a second watershed that has previously given us trouble brings us to usgs_ws_01669000 PISCATAWAY CREEK NEAR TAPPAHANNOCK, VA:
Once again, our typical NLDAS summary method using ST_clip
returns no data. But, it does as soon as we resample it to the PRISM resolution:
featureid | obs_date | mo | resamplemeanprecip | noresamplemeanprecip |
---|---|---|---|---|
617850 | 2023-01-17 05:00:00-05 | 1 | 0 | |
617850 | 2023-01-17 06:00:00-05 | 1 | 0.012 | |
617850 | 2023-01-17 07:00:00-05 | 1 | 0 | |
617850 | 2023-01-17 08:00:00-05 | 1 | 0.1956 | |
617850 | 2023-01-17 09:00:00-05 | 1 | 0.03 | |
617850 | 2023-01-17 10:00:00-05 | 1 | 0.884 | |
617850 | 2023-01-17 11:00:00-05 | 1 | 0.559 | |
617850 | 2023-01-17 12:00:00-05 | 1 | 0.0688 | |
617850 | 2023-01-17 13:00:00-05 | 1 | 0.1256 |
Awesome @COBrogan -- that looks really encouraging!! Also we have to resample, anyway, so this is doubly great.
I am curious, though, could you run your NLDS to query again and include a SummaryStatsAgg(st_clip(...).count
function? I want to see if it is returning 0 points, meaning the intersection doesn't work as I think it should, or if it is the combination of null and non-null values in an intersection that is resolving as null.
Sure, rerunning the above query for Piscataway Creek using count, I get the same warning that NOTICE: All pixels of band have the NODATA value
. Looking at the results below, we see the count for NLDAS prior to resampling is 0. Looking at the above figure, the 4 count for PRISM also makes it seem like ST_Clip is only getting cells that have a center point that intersects with the boundary.
featureid | obs_date | mo | resamplemeanprecip | noresamplemeanprecip |
---|---|---|---|---|
617850 | 2023-01-15 23:00:00-05 | 1 | 4 | 0 |
617850 | 2023-01-16 00:00:00-05 | 1 | 4 | 0 |
617850 | 2023-01-16 01:00:00-05 | 1 | 4 | 0 |
617850 | 2023-01-16 02:00:00-05 | 1 | 4 | 0 |
One potential fix could be convert the rasters into polygons that represent the cell size and resolution. We would only need one copy of this polygon file for each data source. We then run all intersections/clips on the polygons and then use the extent of the polygon intersection as a mask with which to clip our rasters. I'm going to see if this gives us more logical results.
ST_Clip is only getting cells that have a center point that intersects with the boundary
@COBrogan I think the behavior is contained by the boundary not intersects with - do you concur?
@COBrogan thanks for doing the ST_summarystats()
version, looks like a clear case of small coverage as you had hypothesized, too small to contain the centroid of any of the raster cells, and evidence to suggest that it is indeed the centroid of the raster cell that counts in these summary queries. Let's figure out where in the workflow we need to blow thes rasters out to the daymet
resolution and proceed.
Sounds good. I think we just need to update the calc_raster_ts
file for the geo download step. I'm thinking of maybe doing the following, which uses WITH
to create a pre-clipped raster from which we can resample all subsequent NLDAS rasters:
\set band '1'
\set varkey 'nldas2_obs_hourly'
\set hydrocode 'usgs_ws_01668000'
copy (
WITH refRast AS (
select st_clip(met.rast,fgeo.dh_geofield_geom)
FROM dh_feature as f
LEFT JOIN field_data_dh_geofield as fgeo
ON (fgeo.entity_id = f.hydroid
and fgeo.entity_type = 'dh_feature')
LEFT JOIN dh_variabledefinition as v
ON (v.varkey = 'prism_mod_daily')
LEFT JOIN dh_feature as mcov
ON (mcov.hydrocode = 'cbp6_met_coverage')
LEFT JOIN dh_timeseries_weather as met
ON (mcov.hydroid = met.featureid and met.varid = v.hydroid)
LIMIT 1
)
select met.featureid, to_timestamp(met.tsendtime) as obs_date,
extract(month from to_timestamp(met.tsendtime)) as mo,
(ST_summarystats(ST_Resample(met.rast,(SELECT rast FROM refRast)), :band, TRUE)).mean as precip_mm
from dh_feature as f
left outer join field_data_dh_geofield as fgeo
on (
fgeo.entity_id = f.hydroid
and fgeo.entity_type = 'dh_feature'
)
left outer join dh_variabledefinition as v
on (
v.varkey = :'varkey'
)
left outer join dh_feature as mcov
on (
mcov.hydrocode = 'cbp6_met_coverage'
)
left outer join dh_timeseries_weather as met
on (
mcov.hydroid = met.featureid and met.varid = v.hydroid
)
where f.hydrocode = :'hydrocode'
order by met.tsendtime
) to '/tmp/tryNewNLDAS.csv' WITH HEADER CSV;
@COBrogan Thanks a bunch for keeping this on our radar. I just reviewed this in light of our bug discussion today and your remarks about the processing times. I realize I scooted before you could finish up and so I never found out how long it took for the Acquia Creek subwatershed resample + intersection.
Can you annotate how long that took, and also did you use the USING
approach that you outlined above (looked like a promising one to me). Also, in case we need alternatives that are more time efficient I am thinking this might be worth consideration:
dh_timeseries_weather
table with the NLDAS2 (or other) raster clipped to some bounding box that is sufficiently large to insure we get at least 4 cells (there are buffer() commands like any spatial GIS to increase the size, or we can try to create a rectangle around the centroid of a given size).dh_property
to the weather data entry to indicate whether or not it has been clipped (and resampled), although that would only be for accounting, because if we xreate out workflow steps correctly, there will be no penalty for rerunning the clip step on an already clipped raster, other than loss of time.@rburghol I actually had to cancel my query last night. It had taken over 20 minutes and had yet to produce values. I just reran it now using only 2022 and it took about 6 minutes to run. It's just a lot of raster manipulation, given the hourly nature of NLDAS and it seems this resample function is pretty slow (or at least how I'm using it). I can try running the full dataset again and see if its like 1 or 2 hours or like even more but just projecting out, if 1 year takes 6 minutes than the full dataset is likely to take closer to 4 hours.
Create an entry in the dh_timeseries_weather table with the NLDAS2 (or other) raster clipped to some bounding box that is sufficiently large to insure we get at least 4 cells (there are buffer() commands like any spatial GIS to increase the size, or we can try to create a rectangle around the centroid of a given size).
The bounding box for these entries to dh_timeseries_weather would be specific coverages, right? Like, in the geo
workflow, these would be cropped to a buffered USGS gage drainage area? I think this may be the better way to go given how slow my method appears to be (will update this later with the full run-time). I think we could just buffer the gage drainage area by 20 km in all directions. That should capture the relevant cells, right?
@COBrogan 20km would cover it, but I'd prefer to use something determinate, rather than arbitrarily large. There are 4 functions in postgis that get at the cell size of a given raster:
ST_ScaleY()
actually says "Returns the Y component of the pixel height i". I say almost because, the examples in the docs show that they are identical if the raster has no skew, but can have slightly different results if they are non-zero skewed. I believe that if they are not already in the same projection as our coverage, we can reproject the raster before calling st_pixelwidth(), so we can say:
select st_buffer(st_centroid(st_transform(coverage.dh_geofield_geom,4326)), 4.0 * st_pixelwidth(rast)) ...
Alternatively, we can set the buffer distance as a variable in our configuration, and make sure that we choose wisely for each data set.
@rburghol That makes sense to me. This methodology using the pixel
or scale
dimensions are certainly preferable because we may want to do the same thing for PRISM, which would not need a 20 km buffer. Looking at the example you posted regarding the buffer, is there an actual need to use st_centroid
? I think st_buffer
can take a geometry directly based on the documentation
select st_buffer(st_centroid(st_transform(coverage.dh_geofield_geom,4326)), 4.0 * st_pixelwidth(rast)) ...
If you want to keep focusing on flushing out the geo
workflow for simple_lm
, I can try to get something working for this. But, I'm not sure where this would go in our met
workflow. If we are applying this to specific coverages, how would met
know where to find those coverages? I'm assuming these will also be stored under the same variable keys right e.g. prism_mod_daily
, daymet_mod_daily
, nldas_obs_hourly
? Using a different key may help distinguish them as resampled, but I think it would be confusing overall and would mess with our geo
workflow
Adding for documentation as I was unfamiliar with raster skew: https://en.wikipedia.org/wiki/World_file
Difference between st_PixelHeight
and st_ScaleY
is that st_PixelHeight
uses the skew/rotation of the raster to solve for the hypotenuse of the skew and the scale, leading the slightly longer true pixel height.
@COBrogan I like your division of labor. Responses to your Qs:
calc_raster_ts
? Since we need this to do the actual precip summary.geo
workflow, or REPEAT the process in the amalgamate
workflow.
geo/download
(or alternative geo/import
) if we want to create a separate, small raster for each coverage and data-sourceamalgamate/process
if we are concerned about storage size, and only want to store the best-fit rasters that we will eventually sew together during the amalgamate/import
step.Hey @COBrogan I just had a thought: how about rather than altering calc_raster_ts
, we make this a separate data source, like nldas_1km
, and include the reprojection step there. For the base NLDAS2 data, if a small watershed like this has no data, then we can just include logic to disregard null values when analyzing, or set the rating to 0.0 for every timestep in the simulation?
If we do it that way, then later, we can come back and use the nldas_1km
data source (or perhaps it is an analytical method) as part of our amalgamation? Otherwise we're kinda introducing different data processing methods for small watersheds, making a blurring of the distinction between our sources and methods.
First try:
\set band '1'
\set varkey 'nldas2_obs_hourly'
\set hydrocode 'usgs_ws_01668000'
copy (
WITH refRast AS (
select st_clip(met.rast,fgeo.dh_geofield_geom)
FROM dh_feature as f
LEFT JOIN field_data_dh_geofield as fgeo
ON (fgeo.entity_id = f.hydroid
and fgeo.entity_type = 'dh_feature')
LEFT JOIN dh_variabledefinition as v
ON (v.varkey = 'daymet_mod_daily')
LEFT JOIN dh_feature as mcov
ON (mcov.hydrocode = 'cbp6_met_coverage')
LEFT JOIN dh_timeseries_weather as met
ON (mcov.hydroid = met.featureid and met.varid = v.hydroid)
LIMIT 1
)
select met.featureid, to_timestamp(met.tsendtime) as obs_date,
extract(month from to_timestamp(met.tsendtime)) as mo,
extract(year from to_timestamp(met.tsendtime)) as yr,
(ST_summarystats(
ST_Resample(
ST_Clip(met.rast,st_buffer(fgeo.dh_geofield_geom, 4.0 * st_pixelwidth(met.rast))),
(SELECT rast FROM refRast)),
:band, TRUE)).mean as precip_mm
from dh_feature as f
left outer join field_data_dh_geofield as fgeo
on (
fgeo.entity_id = f.hydroid
and fgeo.entity_type = 'dh_feature'
)
left outer join dh_variabledefinition as v
on (
v.varkey = :'varkey'
)
left outer join dh_feature as mcov
on (
mcov.hydrocode = 'cbp6_met_coverage'
)
left outer join dh_timeseries_weather as met
on (
mcov.hydroid = met.featureid and met.varid = v.hydroid
)
where f.hydrocode = :'hydrocode'
-- AND extract(year from to_timestamp(met.tsendtime)) = 2022
order by met.tsendtime
) to '/tmp/tryNewNLDAS.csv' WITH HEADER CSV;
This issue has been addressed in various forms, as it is indeed a resolution issue. See #87
Trying to track down the issue with NLDAS not generating precip in the command: ( _Originally posted by @COBrogan in #57_)
The error keeps saying that all pixels in the area have the NoData value.
NOTICE: All pixels of band have the NODATA value
. But, this does not appear to be the case. If we look at a specific point within the watershed on a day in which we know there is rain in Aquia Creek for usgs_ws_01660400, we can see that NLDAS should have data for this watershed based on the full NLDAS rasters for that day. However, when we clip the raster to just the gage, the clipped raster no longer has data:SQL
``` select met.featureid, to_timestamp(met.tstime) as obs_date, extract(month from to_timestamp(met.tstime)) as mo, ST_Value(met.rast, ST_SetSRID(ST_Point(-77.5, 38.5),4326)) as fullRastPoint, ST_Value(st_clip(met.rast, 1, fgeo.dh_geofield_geom,9999), ST_SetSRID(ST_Point(-77.5, 38.5),4326)) as clipRastPoint from dh_feature as f left outer join field_data_dh_geofield as fgeo on ( fgeo.entity_id = f.hydroid and fgeo.entity_type = 'dh_feature' ) left outer join dh_variabledefinition as v on ( v.varkey = 'nldas2_obs_hourly' ) left outer join dh_feature as mcov on ( mcov.hydrocode = 'cbp6_met_coverage' ) left outer join dh_timeseries_weather as met on ( mcov.hydroid = met.featureid and met.varid = v.hydroid ) where f.hydrocode = 'usgs_ws_01660400' AND extract(year from to_timestamp(met.tsendtime)) = 2023 AND extract(month from to_timestamp(met.tsendtime)) = 1 AND extract(day from to_timestamp(met.tsendtime)) IN (17) order by met.tstime; ```This is NOT the case for Strausberg. At hydrocode usgs_ws_01634000, we can see that a specific point in the watershed shows equal rainfall in both the full and clipped rasters. This makes me think something has gone wrong with our ST_clip command, but I haven't tracked it down yet
SQL
``` select met.featureid, to_timestamp(met.tstime) as obs_date, extract(month from to_timestamp(met.tstime)) as mo, ST_Value(met.rast, ST_SetSRID(ST_Point(-78.7, 38.8),4326)) as fullRastPoint, ST_Value(st_clip(met.rast, 1, fgeo.dh_geofield_geom,9999), ST_SetSRID(ST_Point(-78.7, 38.8),4326)) as clipRastPoint from dh_feature as f left outer join field_data_dh_geofield as fgeo on ( fgeo.entity_id = f.hydroid and fgeo.entity_type = 'dh_feature' ) left outer join dh_variabledefinition as v on ( v.varkey = 'nldas2_obs_hourly' ) left outer join dh_feature as mcov on ( mcov.hydrocode = 'cbp6_met_coverage' ) left outer join dh_timeseries_weather as met on ( mcov.hydroid = met.featureid and met.varid = v.hydroid ) where f.hydrocode = 'usgs_ws_01634000' AND extract(year from to_timestamp(met.tsendtime)) = 2023 AND extract(month from to_timestamp(met.tsendtime)) = 1 AND extract(day from to_timestamp(met.tsendtime)) IN (17) order by met.tstime; ```It's clear that our clip at the very least generates a raster of the shape that we'd generally expect:
SQL
``` select ST_AsText(ST_Envelope(met.rast)), ST_AsText(ST_Envelope(st_clip(met.rast, 1, fgeo.dh_geofield_geom,9999,TRUE))) from dh_feature as f left outer join field_data_dh_geofield as fgeo on ( fgeo.entity_id = f.hydroid and fgeo.entity_type = 'dh_feature' ) left outer join dh_variabledefinition as v on ( v.varkey = 'nldas2_obs_hourly' ) left outer join dh_feature as mcov on ( mcov.hydrocode = 'cbp6_met_coverage' ) left outer join dh_timeseries_weather as met on ( mcov.hydroid = met.featureid and met.varid = v.hydroid ) where f.hydrocode = 'usgs_ws_01660400' AND extract(year from to_timestamp(met.tsendtime)) = 2023 AND extract(month from to_timestamp(met.tsendtime)) = 1 AND extract(day from to_timestamp(met.tsendtime)) IN (17) order by met.tstime LIMIT 1; ```With some more research, my guess is that
ST_clip
doesn't grab partial cells from the raster - that is, it only grabs cells that are contained in the file or at least that the centroid is in. See here. My gage is in between several cells. The below image shows only the extent of the clipped raster from above on January 17th and shows that our site doesn't contain any full pixels:If I use
ST_intersection
I start to see values at our site. I imagine I would get similar results withST_DumpAsPolygons
but I'm guessing that will be computationally slow.ST_intersection
is already pretty slow based on my testing:SQL
``` select met.featureid, to_timestamp(met.tstime) as obs_date, extract(month from to_timestamp(met.tstime)) as mo, (ST_Intersection(met.rast, fgeo.dh_geofield_geom)).val AS clipVal FROM dh_feature as f LEFT JOIN field_data_dh_geofield as fgeo ON ( fgeo.entity_id = f.hydroid and fgeo.entity_type = 'dh_feature' ) LEFT JOIN dh_variabledefinition as v ON ( v.varkey = 'nldas2_obs_hourly' ) LEFT JOIN dh_feature as mcov ON ( mcov.hydrocode = 'cbp6_met_coverage' ) LEFT JOIN dh_timeseries_weather as met ON ( mcov.hydroid = met.featureid and met.varid = v.hydroid ) WHERE f.hydrocode = 'usgs_ws_01660400' AND extract(year from to_timestamp(met.tsendtime)) = 2023 AND extract(month from to_timestamp(met.tsendtime)) = 1 AND extract(day from to_timestamp(met.tsendtime)) IN (17) ORDER BY met.tstime; ```