HARPgroup / model_meteorology

0 stars 0 forks source link

Workflow `amalgamate` #66

Open rburghol opened 3 months ago

rburghol commented 3 months ago

This will produce a full domain coverage raster of amalgamated daily best-fit rasters, and store it in the database for later use by coverage model time series input routines in workflow wdm

Steps

Info

COBrogan commented 1 month ago

First attempt at amalgamating a raster. Pulls in clipped, resampled rasters for daymet and PRISM. The "best" dataset below is picked randomly, this will need to be updated as values are added to REST. Unions NLDAS rasters prior to resampling and clipping. This is the longest part of the query, adding roughly 10 seconds. For one day, this query takes about 15 seconds. Scaling this to the full time period would result take about 60 hours (15 sec/day 365 day/yr 39 yr / 3600 sec/hours). To-do:

--Given we have a file that indicates which dataset performed best, we should have
--three different datasets defined via WITH, grabbing and clipping coverages based on the file

--Arbitrarily pick a tsendtime to practice with, here February 18, 2020. 
--Note that with NLDAS this will pick an abitrary hour and we need the full 24-hour set
\set tsendin '1582027200'
\set resample_varkey 'daymet_mod_daily'
-- sets all integer feature and varid with query 
select hydroid as covid from dh_feature where hydrocode = 'cbp6_met_coverage' \gset

--Grab the USGS full drainage geometries/coverages and assign ratings to inicate best 
--performing precip dataset
WITH usgsCoverage as (
    SELECT f.*,
    --Join in ratings. Until ratings are put in db via REST, let's
    --use a random integer between 0 (NLDAS) and 2 (daymet)
    floor(random() * (2-0+1) + 0)::int as dataID,
    --Add area of watershed/coverage for refernce and to order from downstream to upstrea
    ST_AREA(fgeo.dh_geofield_geom) as covArea,
    fgeo.dh_geofield_geom as dh_geofield_geom
    FROM dh_feature as f
    LEFT JOIN field_data_dh_geofield as fgeo
    on (
        fgeo.entity_id = f.hydroid
        and fgeo.entity_type = 'dh_feature' 
    ) 
    WHERE f.bundle = 'watershed' AND f.ftype = 'usgs_full_drainage'
    ORDER BY covArea DESC
),
--Get the geometry and feature fields for the full coverage based on the covid variable gset above. 
--This will be used to create a resampled NLDAS for the day
fullCoverage as (
    SELECT f.*,fgeo.dh_geofield_geom
    FROM dh_feature as f
    LEFT JOIN field_data_dh_geofield as fgeo
    on (
        fgeo.entity_id = f.hydroid
        and fgeo.entity_type = 'dh_feature' 
    ) 
    WHERE f.hydroid = :'covid'
),
--Where PRISM is the best performing dataset, grab the appropriate
--dialy raster from dh_weather_timeseries and resample to target resolution
--and then clip to watershed boundaries
prism as (
    SELECT cov.*,
    met.featureid,met.tsendtime,
    st_clip(st_resample(met.rast,rt.rast), cov.dh_geofield_geom) as rast
    FROM usgsCoverage as cov
    JOIN(
        select *
        from dh_timeseries_weather as met
        left outer join dh_variabledefinition as b
            on (met.varid = b.hydroid) 
        where b.varkey='prism_mod_daily'
            and met.featureid = :covid
            and met.tsendtime = :'tsendin'
    ) AS met
    ON ST_Intersects(ST_ConvexHull(met.rast),cov.dh_geofield_geom)
    LEFT JOIN (select rast from raster_templates where varkey = :'resample_varkey') as rt
    ON 1 = 1
    WHERE cov.dataID = 1
),
--Where daymet is the best performing dataset, grab the appropriate
--daily raster from dh_weather_timeseries and resample to target resolution
--and then clip to watershed boundaries
daymet as (
    SELECT cov.*,
    met.featureid,met.tsendtime,
    st_clip(st_resample(met.rast,rt.rast), cov.dh_geofield_geom) as rast
    FROM usgsCoverage as cov
    JOIN(
        select *
        from dh_timeseries_weather as met
        left outer join dh_variabledefinition as b
            on (met.varid = b.hydroid) 
        where b.varkey='daymet_mod_daily'
            and met.featureid = :covid
            and met.tsendtime = :'tsendin'
    ) AS met
    ON ST_Intersects(ST_ConvexHull(met.rast),cov.dh_geofield_geom)
    LEFT JOIN (select rast from raster_templates where varkey = :'resample_varkey') as rt
    ON 1 = 1
    WHERE cov.dataID = 2
),
--Union all NLDAS rasters for the day to get the sum of NLDAS for the day
nldasFullDay AS (
    SELECT st_union(met.rast,'sum') as rast
    FROM (
        select *
        from dh_timeseries_weather as met
        left outer join dh_variabledefinition as b
            on (met.varid = b.hydroid) 
        where b.varkey='nldas2_precip_hourly_tiled_16x16'
            and met.featureid = :covid
            and extract(year from to_timestamp(met.tsendtime)) = extract(year from to_timestamp(:'tsendin'))
            and extract(month from to_timestamp(met.tsendtime)) = extract(month from to_timestamp(:'tsendin'))
            and extract(day from to_timestamp(met.tsendtime)) = extract(day from to_timestamp(:'tsendin'))
    ) AS met
),
nldasFullDayResamp AS (
    select st_resample(met.rast,rt.rast) as rast
    FROM fullCoverage as f
    JOIN nldasFullDay as met
    ON ST_ConvexHull(met.rast) && f.dh_geofield_geom
    LEFT JOIN (select rast from raster_templates where varkey = :'resample_varkey') as rt
    ON 1 = 1
),
--Union all NLDAS rasters for the day, intersecting by the usgsCoverage geometries
--to leverage the tiled NLDAS rasters. The end result is a raster for each coverage 
--where NLDAS is the most highly rated that is of the full day's dataset, 
--but clipped to only intersecting tiles
nldasDay as (
    SELECT cov.hydroid, cov.hydrocode,
    cov.ftype, cov.bundle, cov.name,
    :'tsendin' as tsendtime,
    st_union(met.rast,'sum') as rast
    FROM usgsCoverage as cov
    JOIN(
        select *
        from dh_timeseries_weather as met
        left outer join dh_variabledefinition as b
            on (met.varid = b.hydroid) 
        where b.varkey='nldas2_precip_hourly_tiled_16x16'
            and met.featureid = :covid
            and extract(year from to_timestamp(met.tsendtime)) = extract(year from to_timestamp(:'tsendin'))
            and extract(month from to_timestamp(met.tsendtime)) = extract(month from to_timestamp(:'tsendin'))
            and extract(day from to_timestamp(met.tsendtime)) = extract(day from to_timestamp(:'tsendin'))
    ) AS met
    ON ST_Intersects(ST_ConvexHull(met.rast),cov.dh_geofield_geom)
    WHERE cov.dataID = 0
    GROUP BY cov.hydroid, cov.hydrocode, cov.ftype,
        cov.bundle, cov.name
),
--Now, using the union of NLDAS hourly data in nldasDay, resample to the template raster and clip to each 
--watershed where NLDAS is rated the best via an INNER JOIN and the WHERE in nldasDay
nldas as (
    SELECT cov.*,met.tsendtime,
    st_clip(st_resample(met.rast,rt.rast), cov.dh_geofield_geom) as rast
    FROM usgsCoverage as cov
    INNER JOIN nldasDay as met
    on cov.hydroid = met.hydroid
    LEFT JOIN (select rast from raster_templates where varkey = :'resample_varkey') as rt
    ON 1 = 1
),
--For each feature in usgsCoverage, find the non-NULL summary dataset (which will represent the best rated dataset)
amalgamate as (
    select cov.*, COALESCE(prismMet.rast,daymetMet.rast,nldasMet.rast) as rast
    FROM usgsCoverage as cov
    LEFT JOIN nldas as nldasMet
    on cov.hydroid = nldasMet.hydroid
    LEFT JOIN prism as prismMet
    on cov.hydroid = prismMet.hydroid
    LEFT JOIN daymet as daymetMet
    on cov.hydroid = daymetMet.hydroid
),
--Union the best rated datasets together. Since the data is sorted by drainage area, 
--upstream areas will populate after larger, downstream coverages
amalgamateUnion as (
    SELECT ST_union(amalgamate.rast) as rast
    FROM usgsCoverage as cov
    LEFT JOIN amalgamate as amalgamate
    on cov.hydroid = amalgamate.hydroid
)
--Use a full union to create a column with the amalgamateUnion raster and the nldasFulLDayResamp raster
--Then, union the rasters to get a raster in which the "background" is the nldasFullDayResamp and everything else is 
--the best fit raster
SELECT ST_union(fullUnion.rast) as rast
FROM (
    SELECT rast FROM nldasFullDayResamp
    UNION ALL
    SELECT rast FROM amalgamateUnion 
) as fullUnion;
COBrogan commented 1 month ago
Testing the above raster. Using random numbers to assign best fit dataset (for now): hydroid hydrocode ftype bundle name dataid covarea
289982 usgs_ws_02029000 usgs_full_drainage watershed JAMES RIVER AT SCOTTSVILLE, VA 0 1.2134
290060 usgs_ws_03176500 usgs_full_drainage watershed NEW RIVER AT GLEN LYN, VA 2 0.9902
290065 usgs_ws_02026000 usgs_full_drainage watershed JAMES RIVER AT BENT CREEK, VA 1 0.9666

So, let's find the mean precipitation in this watershed using the resampled PRISM data from dh_timeseries_weather

Check `dh_timeseries_weather` PRISM data ``` \set hydrocode 'usgs_ws_02026000' \set band '1' \set tsendin '1582027200' --\set varkey 'nldas2_precip_hourly_tiled_16x16' \set varkey 'prism_mod_daily' --\set varkey 'daymet_mod_daily' \set resample_varkey 'daymet_mod_daily' -- sets all integer feature and varid with query select hydroid as covid from dh_feature where hydrocode = 'cbp6_met_coverage' \gset WITH usgs_features AS ( SELECT * FROM dh_feature WHERE hydrocode = :'hydrocode' ), metUnion as ( Select met.featureid, st_union(met.rast,'sum') as rast FROM usgs_features as f left outer join field_data_dh_geofield as fgeo on ( fgeo.entity_id = f.hydroid and fgeo.entity_type = 'dh_feature' ) JOIN( select * from dh_timeseries_weather as met left outer join dh_variabledefinition as b on (met.varid = b.hydroid) where b.varkey=:'varkey' and met.featureid = :covid and extract(year from to_timestamp(met.tsendtime)) = extract(year from to_timestamp(:'tsendin')) and extract(month from to_timestamp(met.tsendtime)) = extract(month from to_timestamp(:'tsendin')) and extract(day from to_timestamp(met.tsendtime)) = extract(day from to_timestamp(:'tsendin')) ) AS met ON fgeo.dh_geofield_geom && met.bbox group by met.featureid ), met as ( Select met.featureid, to_timestamp(:'tsendin') as obs_date, extract(year from to_timestamp(:'tsendin')) as yr, extract(month from to_timestamp(:'tsendin')) as mo, extract(day from to_timestamp(:'tsendin')) as da, extract(hour from to_timestamp(:'tsendin')) as hr, (ST_summarystats(st_clip(st_resample(met.rast,rt.rast), fgeo.dh_geofield_geom), :'band', TRUE)).mean as stats FROM usgs_features as f left outer join field_data_dh_geofield as fgeo on ( fgeo.entity_id = f.hydroid and fgeo.entity_type = 'dh_feature' ) JOIN metUnion AS met ON ST_Intersects(ST_ConvexHull(met.rast),fgeo.dh_geofield_geom) left join (select rast from raster_templates where varkey = :'resample_varkey') as rt ON 1 = 1 ) select featureid, obs_date, yr, mo, da, hr, 0.0393701 * stats precip_in from met order by met.obs_date; ```

featureid | obs_date | yr | mo | da | hr | precip_in -----------+------------------------+------+----+----+----+-------------------- 617850 | 1981-09-16 08:00:00-04 | 1981 | 9 | 16 | 8 | 0.6558419584256406

Now let's find the mean precipitation in this watershed using the amalgamated data from tmp_amalgamate

Check `amalgamate` data ``` WITH usgs_features AS ( SELECT * FROM dh_feature WHERE hydrocode = :'hydrocode' ), met as ( Select f.hydroid as featureid, to_timestamp(:'tsendin') as obs_date, extract(year from to_timestamp(:'tsendin')) as yr, extract(month from to_timestamp(:'tsendin')) as mo, extract(day from to_timestamp(:'tsendin')) as da, extract(hour from to_timestamp(:'tsendin')) as hr, (ST_summarystats(st_clip(met.rast, fgeo.dh_geofield_geom), :'band', TRUE)).mean as stats FROM usgs_features as f left outer join field_data_dh_geofield as fgeo on ( fgeo.entity_id = f.hydroid and fgeo.entity_type = 'dh_feature' ) JOIN tmp_amalgamate AS met ON ST_Intersects(ST_ConvexHull(met.rast),fgeo.dh_geofield_geom) ) select featureid, obs_date, yr, mo, da, hr, 0.0393701 * stats precip_in from met order by met.obs_date; ```

featureid | obs_date | yr | mo | da | hr | precip_in -----------+------------------------+------+----+----+----+-------------------- 290065 | 1981-09-16 08:00:00-04 | 1981 | 9 | 16 | 8 | 0.5779727155453814 (1 row)

These different values actually make sense. This is a larger watershed and will thus change after amalgamation as smaller watersheds are unioned on top. We instead need to find a headwater watershed, which should remain unchanged. For instance, usgs_ws_01645784 may work as it is our smallest watershed by area (used daymet in this run) and usgs_ws_01654500 (which used PRISM in this run). Using the above code to evaluate usgs_ws_01654500, we see an identical value between amalgamate and dh_timeseries_weather's PRISM data. Similarly, usgs_ws_01645784 produced identical data to dh_timeseries_weather's daymet data. usgs_ws_02055100 is larger, but used NLDAS data in this random allocation of ratings. Using the code above, we again see agreement between dh_timeseries_weather and the amalgamate method above

rburghol commented 1 month ago

This is great progress @COBrogan thanks for pushing it along! I was today days old when I learned the SQL term COALESCE so thanks!

I had a couple questions and speculations (and may reveal my misunderstanding):

COBrogan commented 1 month ago
  • could your headwater data not yield the same data since the selection if best fit is random (for this demo)?

The headwaters are revealing the same data as the data from the randomly selected dataset. I created some temp tables to test this out but that code isn’t here (but I have it on OneDrive if I need to refer to it again).

  • I wonder if we can improve performance by using the daily sum raster cor nldas2 that @mwdunlap2004 is working on so it's just another prepopulated varkey?

Absolutely! Those raster will reduce run time by at least 33% I think.

  • I think you have squeezed out at least some of the performance since your WITH queries are o ly grabbing the data for days when they're the selected source.

That’s what I’m hoping! I tried to put a comment block above each ‘WITH’ to help it be a little more legible bc the WITHs are difficult to read.

  • are you actually pulling in all USGS coverages for the amalgamated layers? I got a little lost between usgsCoverage, which seemms to grab all shapes, and fullCoverage which seems to limit to a larger watershed of interest.

‘fullCoverage’ is the union of nldas for the whole CB extent. It’s unioned at the end to serve as the “background” data is no better data exists

  • if you are actually grabbing all the gage covs, do we need to use the single coverage in fullCoverage or is that just to get the base raster? I think I answered my own Question but humor me please 😝

To get the base raster. There may be a better way when we finish with temporal disaggregateion

  • All in all, a 60 hour amalgamation process for the entire domain does not seem outrageous to me -- of course I will soon want to do it in 45 minutes but that is the way of things!

Hopefully we speed it up!

I’ll be working on confirming that the larger watersheds are being overwritten by smaller headwaters using st point next week.

rburghol commented 1 month ago

Thinking about steps process -> 01_acquire and process -> 02_rank.

rburghol commented 1 month ago

@COBrogan Putting a little thought into how ratings could be included as part of the WITH section. Here it is in SQL form:

WITH ratings as (
  select f.hydroid, scen.propname, 
    rts.tstime, rts.tsendtime, rts.tsvalue as best_varid
  from usgs_coverage as f
  left outer join dh_properties as scen 
  on (
    scen.featureid = f.hydroid
    and scen.propname = :'ama_scenario'
  )
)

Then, we could collapse the WITH statements for prism and daymet into a single generic query where we simply selected varid joined to ratings.best_varid like so:

BESTmet as (
    SELECT cov.*,
    met.featureid,met.tsendtime,
    st_clip(st_resample(met.rast,rt.rast), cov.dh_geofield_geom) as rast
    FROM usgsCoverage as cov
    JOIN(
        select *
        from dh_timeseries_weather as met
        left outer join dh_timeseries as b
            on (
          met.varid = b.best_varid
          and met.tstime >= b.tstime
          and met.tsendtime <= b.tsendtime
        ) 
        where 
            and met.featureid = :covid
            and met.tsendtime = :'tsendin'
    ) AS met
    ON ST_Intersects(ST_ConvexHull(met.rast),cov.dh_geofield_geom)
    LEFT JOIN (select rast from raster_templates where varkey = :'resample_varkey') as rt
    ON 1 = 1
    WHERE cov.dataID = 2
),

For more information on how I think we might go about creating those pre-amalgamated ratings for each coverage see #96

COBrogan commented 1 month ago

@mwdunlap2004 @rburghol Using st_value, I checked the values of the amalgamated raster above at a few locations. My goal was to ensure the amalgamation was working so I looked at points in a large watershed (e.g. usgs_ws_02035000 James River at Cartersville) at:

In each case, I found that the amalgamated raster was working. I think the amalgamation is working as intended!


See here for a modified amalgamate code that creates temp tables for the ratings and amalgamate raster for later exploration. I used this to test amalgamate in three scenarios above, but please note that the ratings are still randomly assigned such that each time you run this you will need to check the datasources selected in scenarios 1 and 2:

Temp Table Amalgamate ``` sql --Given we have a file that indicates which dataset performed best, we should have --three different datasets defined via WITH, grabbing and clipping coverages based on the file --Arbitrarily pick a tsendtime to practice with, here September 16, 1981. --Note that with NLDAS this will pick an abitrary hour and we need the full 24-hour set \set tsendin '369489600' \set resample_varkey 'daymet_mod_daily' -- sets all integer feature and varid with query select hydroid as covid from dh_feature where hydrocode = 'cbp6_met_coverage' \gset create temp table tmp_usgsCoverage as ( SELECT f.*, --Join in ratings. Until ratings are put in db via REST, let's --use a random integer between 0 (NLDAS) and 2 (daymet) floor(random() * (2-0+1) + 0)::int as dataID, --Add area of watershed/coverage for refernce and to order from downstream to upstrea ST_AREA(ST_MakeValid(fgeo.dh_geofield_geom)) as covArea, fgeo.dh_geofield_geom as dh_geofield_geom FROM dh_feature as f LEFT JOIN field_data_dh_geofield as fgeo on ( fgeo.entity_id = f.hydroid and fgeo.entity_type = 'dh_feature' ) WHERE f.bundle = 'watershed' AND f.ftype = 'usgs_full_drainage' ORDER BY covArea DESC ) create temp table tmp_amalgamate as ( --Grab the USGS full drainage geometries/coverages and assign ratings to inicate best --performing precip dataset WITH usgsCoverage as ( SELECT * FROM tmp_usgsCoverage ), --Get the geometry and feature fields for the full coverage based on the covid variable gset above. --This will be used to create a resampled NLDAS for the day fullCoverage as ( SELECT f.*,fgeo.dh_geofield_geom FROM dh_feature as f LEFT JOIN field_data_dh_geofield as fgeo on ( fgeo.entity_id = f.hydroid and fgeo.entity_type = 'dh_feature' ) WHERE f.hydroid = :'covid' ), --Where PRISM is the best performing dataset, grab the appropriate --dialy raster from dh_weather_timeseries and resample to target resolution --and then clip to watershed boundaries prism as ( SELECT cov.*, met.featureid,met.tsendtime, st_clip(st_resample(met.rast,rt.rast), cov.dh_geofield_geom) as rast FROM usgsCoverage as cov JOIN( select * from dh_timeseries_weather as met left outer join dh_variabledefinition as b on (met.varid = b.hydroid) where b.varkey='prism_mod_daily' and met.featureid = :covid and met.tsendtime = :'tsendin' ) AS met ON ST_Intersects(ST_ConvexHull(met.rast),cov.dh_geofield_geom) LEFT JOIN (select rast from raster_templates where varkey = :'resample_varkey') as rt ON 1 = 1 WHERE cov.dataID = 1 ), --Where daymet is the best performing dataset, grab the appropriate --daily raster from dh_weather_timeseries and resample to target resolution --and then clip to watershed boundaries daymet as ( SELECT cov.*, met.featureid,met.tsendtime, st_clip(st_resample(met.rast,rt.rast), cov.dh_geofield_geom) as rast FROM usgsCoverage as cov JOIN( select * from dh_timeseries_weather as met left outer join dh_variabledefinition as b on (met.varid = b.hydroid) where b.varkey='daymet_mod_daily' and met.featureid = :covid and met.tsendtime = :'tsendin' ) AS met ON ST_Intersects(ST_ConvexHull(met.rast),cov.dh_geofield_geom) LEFT JOIN (select rast from raster_templates where varkey = :'resample_varkey') as rt ON 1 = 1 WHERE cov.dataID = 2 ), --Union all NLDAS rasters for the day to get the sum of NLDAS for the day nldasFullDay AS ( SELECT st_union(met.rast,'sum') as rast FROM ( select * from dh_timeseries_weather as met left outer join dh_variabledefinition as b on (met.varid = b.hydroid) where b.varkey='nldas2_precip_hourly_tiled_16x16' and met.featureid = :covid and extract(year from to_timestamp(met.tsendtime)) = extract(year from to_timestamp(:'tsendin')) and extract(month from to_timestamp(met.tsendtime)) = extract(month from to_timestamp(:'tsendin')) and extract(day from to_timestamp(met.tsendtime)) = extract(day from to_timestamp(:'tsendin')) ) AS met ), nldasFullDayResamp AS ( select st_resample(met.rast,rt.rast) as rast FROM fullCoverage as f JOIN nldasFullDay as met ON ST_ConvexHull(met.rast) && f.dh_geofield_geom LEFT JOIN (select rast from raster_templates where varkey = :'resample_varkey') as rt ON 1 = 1 ), --Union all NLDAS rasters for the day, intersecting by the usgsCoverage geometries --to leverage the tiled NLDAS rasters. The end result is a raster for each coverage --where NLDAS is the most highly rated that is of the full day's dataset, --but clipped to only intersecting tiles nldasDay as ( SELECT cov.hydroid, cov.hydrocode, cov.ftype, cov.bundle, cov.name, :'tsendin' as tsendtime, st_union(met.rast,'sum') as rast FROM usgsCoverage as cov JOIN( select * from dh_timeseries_weather as met left outer join dh_variabledefinition as b on (met.varid = b.hydroid) where b.varkey='nldas2_precip_hourly_tiled_16x16' and met.featureid = :covid and extract(year from to_timestamp(met.tsendtime)) = extract(year from to_timestamp(:'tsendin')) and extract(month from to_timestamp(met.tsendtime)) = extract(month from to_timestamp(:'tsendin')) and extract(day from to_timestamp(met.tsendtime)) = extract(day from to_timestamp(:'tsendin')) ) AS met ON ST_Intersects(ST_ConvexHull(met.rast),cov.dh_geofield_geom) WHERE cov.dataID = 0 GROUP BY cov.hydroid, cov.hydrocode, cov.ftype, cov.bundle, cov.name ), --Now, using the union of NLDAS hourly data in nldasDay, resample to the template raster and clip to each --watershed where NLDAS is rated the best via an INNER JOIN and the WHERE in nldasDay nldas as ( SELECT cov.*,met.tsendtime, st_clip(st_resample(met.rast,rt.rast), cov.dh_geofield_geom) as rast FROM usgsCoverage as cov INNER JOIN nldasDay as met on cov.hydroid = met.hydroid LEFT JOIN (select rast from raster_templates where varkey = :'resample_varkey') as rt ON 1 = 1 ), --For each feature in usgsCoverage, find the non-NULL summary dataset (which will represent the best rated dataset) amalgamate as ( select cov.*, COALESCE(prismMet.rast,daymetMet.rast,nldasMet.rast) as rast FROM usgsCoverage as cov LEFT JOIN nldas as nldasMet on cov.hydroid = nldasMet.hydroid LEFT JOIN prism as prismMet on cov.hydroid = prismMet.hydroid LEFT JOIN daymet as daymetMet on cov.hydroid = daymetMet.hydroid ), --Union the best rated datasets together. Since the data is sorted by drainage area, --upstream areas will populate after larger, downstream coverages amalgamateUnion as ( SELECT ST_union(amalgamate.rast) as rast FROM usgsCoverage as cov LEFT JOIN amalgamate as amalgamate on cov.hydroid = amalgamate.hydroid ) --Use a full union to create a column with the amalgamateUnion raster and the nldasFulLDayResamp raster --Then, union the rasters to get a raster in which the "background" is the nldasFullDayResamp and everything else is --the best fit raster SELECT ST_union(fullUnion.rast) as rast FROM ( SELECT rast FROM nldasFullDayResamp UNION ALL SELECT rast FROM amalgamateUnion ) as fullUnion ); ```

In my case, I found that daymet was randomly selected to be the "rating" for usgs_ws_02035000:

select dataid from tmp_usgsCoverage where hydrocode = 'usgs_ws_02035000';
 dataid
--------
      2

PRISM was selected as the random "rating" for usgs_ws_02030000:

select dataid from tmp_usgsCoverage where hydrocode = 'usgs_ws_02030000';
 dataid
--------
      1

First, let's check what value the amalgamated raster has in scenario 1 above (at a point in usgs_ws_02035000 that does not overlap any other watershed). Here, we should see that the amalgamated raster is equal to the raw data in dh_timeseries_weather for daymet. The value in amalgamated raster: (set some variables first for ease of checking)

\set hydrocode 'cbp6_met_coverage'
\set band '1'
\set tsendin '369489600'
\set varkey 'daymet_mod_daily'
\set resample_varkey 'daymet_mod_daily'
--Set point to extract
\set latitude 37.792314
\set longitude -78.157391
-- sets all integer feature and varid with query 
select hydroid as covid from dh_feature where hydrocode = 'cbp6_met_coverage' \gset
WITH usgs_features AS (
  --SELECT * 
  --FROM  dh_feature as f
  --WHERE f.bundle = 'watershed' AND f.ftype = 'usgs_full_drainage'
  SELECT * 
  FROM  dh_feature as f
    left outer join field_data_dh_geofield as fgeo
    on (
        fgeo.entity_id = f.hydroid
        and fgeo.entity_type = 'dh_feature' 
    ) 
  WHERE f.hydrocode = :'hydrocode'
),
testPoint as (
    SELECT ST_SetSRID(ST_Point( :'longitude', :'latitude'),4326) as testPoint
),
met as (
    Select f.hydrocode as hydrocode, to_timestamp(:'tsendin') as obs_date,
        extract(year from to_timestamp(:'tsendin')) as yr,
        extract(month from to_timestamp(:'tsendin')) as mo,
        extract(day from to_timestamp(:'tsendin')) as da,
        extract(hour from to_timestamp(:'tsendin')) as hr,
        st_value(met.rast, :'band', testPoint.testPoint) as stats
    FROM usgs_features as f
    JOIN tmp_amalgamate AS met
    ON ST_Intersects(ST_ConvexHull(met.rast),ST_SetSRID(f.dh_geofield_geom,4326))
    JOIN testPoint as testPoint
    ON ST_INTERSECTS(testPoint.testPoint,ST_SetSRID(f.dh_geofield_geom,4326))
)
select hydrocode, obs_date, yr, mo, da, hr, 
  0.0393701 * stats precip_in
from met
order by met.obs_date;
hydrocode obs_date yr mo da hr precip_in
usgs_ws_02035000 1981-09-16 08:00:00-04 1981 9 16 8 0.8047248650259018

We compare this to the value for daymet from dh_timeseries_weather. Notice we supply a hydrocode here to speed up the query:

WITH usgs_features AS (
  SELECT * 
  FROM  dh_feature as f
    left outer join field_data_dh_geofield as fgeo
    on (
        fgeo.entity_id = f.hydroid
        and fgeo.entity_type = 'dh_feature' 
    ) 
  WHERE f.hydrocode = :'hydrocode'
),
testPoint as (
    SELECT ST_SetSRID(ST_Point( :'longitude', :'latitude'),4326) as testPoint
),
metUnion as (
    Select 
        st_union(met.rast,'sum') as rast
    FROM usgs_features as f
    JOIN testPoint as testPoint
    ON 1 = 1
    JOIN(
        select *
        from dh_timeseries_weather as met
        left outer join dh_variabledefinition as b
            on (met.varid = b.hydroid) 
        where b.varkey=:'varkey'
            and met.featureid = :covid
            and extract(year from to_timestamp(met.tsendtime)) = extract(year from to_timestamp(:'tsendin'))
            and extract(month from to_timestamp(met.tsendtime)) = extract(month from to_timestamp(:'tsendin'))
            and extract(day from to_timestamp(met.tsendtime)) = extract(day from to_timestamp(:'tsendin'))
    ) AS met
    ON ST_INTERSECTS(testPoint.testPoint, met.bbox)
),
met as (
    Select f.hydrocode, to_timestamp(:'tsendin') as obs_date,
        extract(year from to_timestamp(:'tsendin')) as yr,
        extract(month from to_timestamp(:'tsendin')) as mo,
        extract(day from to_timestamp(:'tsendin')) as da,
        extract(hour from to_timestamp(:'tsendin')) as hr,
        st_value(st_resample(met.rast,rt.rast), :'band', testPoint.testPoint) as stats
    FROM usgs_features as f
    JOIN metUnion AS met
    ON ST_Intersects(ST_ConvexHull(met.rast),ST_SetSRID(f.dh_geofield_geom,4326))
    JOIN testPoint as testPoint
    ON ST_INTERSECTS(testPoint.testPoint,ST_SetSRID(f.dh_geofield_geom,4326))
    left join (select rast from raster_templates where varkey = :'resample_varkey') as rt
    ON 1 = 1
)
select hydrocode, obs_date, yr, mo, da, hr, 
  0.0393701 * stats precip_in
from met
order by met.obs_date;
featureid obs_date yr mo da hr precip_in
617850 1981-09-16 08:00:00-04 1981 9 16 8 0.8047248650259018

Success! We have matching values. Now, what about in a point in hydrocode usgs_ws_02030000 that overlaps usgs_ws_02035000? Here, the amalgamated raster should be equal to the value in the raw PRISM data as that was "selected" for this upstream watershed and should have been unioned on top of the daymet data in the amalgamated raster. Rerun the above amalgamate code now with:

\set latitude 37.923872
\set longitude -78.587890
\set varkey 'prism_mod_daily'
hydrocode obs_date yr mo da hr precip_in
usgs_ws_02030000 1981-09-16 08:00:00-04 1981 9 16 8 0.6486224125185013

We now rerun the query against dh_timeseries_weather using the PRISM data and the updated coordinates. We expect and identical value for precip:

featureid obs_date yr mo da hr precip_in
617850 1981-09-16 08:00:00-04 1981 9 16 8 0.6486224125185013

Success! The smaller watershed was correctly amalgamated on top of the larger, downstream watershed. With our last case, we want to ensure that the "background" raster exists where there is no USGS coverage. In the amalgamated raster, this should be:

\set latitude 40.299638
\set longitude -76.909790
\set varkey 'nldas2_precip_hourly_tiled_16x16'
hydrocode obs_date yr mo da hr precip_in
cbp6_met_coverage 1981-09-16 08:00:00-04 1981 9 16 8 0.28930330582999997

Now let's compare to the tiled nldas data in dh_timeseries_weather:

hydrocode obs_date yr mo da hr precip_in
cbp6_met_coverage 1981-09-16 08:00:00-04 1981 9 16 8 0.28930330582999997

Once again, we see the same value. So NLDAS2 is successfully being amalgamated as the "background" precip values.

COBrogan commented 1 month ago

Putting a little thought into how ratings could be included as part of the WITH section. Here it is in SQL form:

@rburghol I like your idea of how to add in the ratings and the "best" varid. I think that will help clean up our query because that combined PRISM/daymet statement will work well. This is a great way to structure out how to deal with the ratings files and will make the query more applicable to future datasets (currently, we'd have to keep adding to the WITH and we can get around that by storing+using the varid as you indicated). I'll get started on this tomorrow if I can get done a few VPDES tasks, otherwise I'll start early next week.

COBrogan commented 1 month ago

Amalgamate is unique in that the first steps 01_aquire and 02_rank can be considered coverage dependent but time agnostic based solely on the files output by geo 03 process. But, the amalgamation and knitting of a "best" fit raster are time-dependent and time agnostic. To maintain only three inputs into our workflow meta model (scenario, coverage, directory), we need to put all of the relevant ranking data into our db. In that way, we can call amalgamate with date, pull the relevant dates for all coverages defined in the config, select the "best fit", store the "best fit" in the db, and amalgamate the raster.

Returning to our scheme at the top of this issue:

TODO: (rasters OR csv) determine data model for storing unique, scenario based in dh_timeseries_weather/dh_timeseries Add a property for the model onto the coverage feature Add a scenario sub-prop to the feature model Save dh_timeseries_weather entries attached to entity_type='dh_properties', featureid=[scenario pid] All amalgamates are the same varkey on the scenario sub-prop where scenario is something like simple_lm

flowchart TD
    B(Coverage Feature) --> |hydrocode = usgs_ws_01615000 | A[(dh_feature)]
    C(Model Property) -->|feature id = e.g. 290064 
varkey = 'om_model_object'
 propname = coverage_met propcode = met1.0| B
    D(Scenario Property) -->|entitiy type = dh_property varkey = 'met_model_scenario' propname= e.g. simple_lm| C
    E("raw" ranking timeseries from _geo_) -->|varkey = e.g. ranking_prism| D
    G(CBP Extent Feature) --> |hydrocode = 'cbp6_met_coverage'| A[(dh_feature)]
    C(Model Property) -->|feature id = e.g. 617850 varkey = 'met_model' propcode = amalgamation | G
    H(Scenario Property) -->|entitiy type = dh_property varkey = 'met_model_scenario' propname = e.g. simple_lm| C
    F(best-fit ranking timeseries e.g. amalgamate ranking) --> |varkey = 'best_fit_ranking'|H
COBrogan commented 1 month ago

New variable definitions for ratings files with varkeys rating_prism, rating_daymet, rating_nldas2, and met_best_rating:

--Variable definition for rating files
insert into dh_variabledefinition(varname,vardesc, vocabulary, varunits, varkey,datatype,varcode,varabbrev,multiplicity)
VALUES
('Best fit met ratings','Best fit ratings timeseries derived from met amalgamate',
 'met_best', '', 'met_best_rating','value','met_best_rating','best_rating',''),
('PRISM met model ratings','Ratings derived from met models using PRISM data',
 'met_prism', '', 'rating_prism','value','rating_prism','rating_prism',''),
('daymet met model ratings','Ratings derived from met models using daymet data',
 'met_daymet', '', 'rating_daymet','value','rating_daymet','rating_daymet',''),
('NLDAS2 met model ratings','Ratings derived from met models using NLDAS2 data',
 'met_nldas2', '', 'rating_nldas2','value','rating_nldas2','rating_nldas2','');
COBrogan commented 1 month ago

Added two steps to geo analyze: 09_importModelPropsToDB = Finds or creates a property for the met1.0 model on the USGS coverage. Then creates a scenario based on the config. Ranking timeseries will be stored under this scenario. 10_importRatingstoDB = Imports the timeseries into dh_weather

Testing on hydrocode usgs_ws_02038000: /opt/model/meta_model/run_model raster_met stormVol_prism usgs_ws_02038000 auto geo analyze 09 coverage feature id 290098 model pid 7583836 scenario pid 7583843

/opt/model/meta_model/run_model raster_met stormVol_prism usgs_ws_02038000 auto geo analyze 10 COPY 31

Test on hydrocode usgs_ws_02021500: coverage feature id 289948 model pid 7592953 scenario pid 7592960 COPY 522

Currently, UPDATE is not working and is just overwriting all values in range with just one row