Workflow `amalgamate` - Githubissues

This will produce a full domain coverage raster of amalgamated daily best-fit rasters, and store it in the database for later use by coverage model time series input routines in workflow wdm

Steps

Assemble daily total amalgamated rasters for days in simulation period, using best available data according to ranking file.

Info

Overview of functions needed:
- Given:
  - a default hourly raster (has 0.0 values and resolution of final dataset)
  - a varkey indicating the name of coverage best fit records to amalgamate
  - a coverage with ranking models completed for the potential data sources
  - a table of raster data sources with a ranking value for each potential data source for each day of the table
  - A base raster with 1 record for every hour
  - A raster of fraction of daily in each time period based on NLDAS2
    - Create using workflows developed in: #71
    - Store it as varkey = nldas2_precip_hourly_frac
- Create:
  - a raster with the fraction of daily total cell precip per hour for every timestep in base dataset (or in disagg data set)
  - This could be added as a band on the base raster
  - This only needs to be done once with the base dataset
- Create a small coverage specific raster for a single day (maybe this happens in geo -> import?)
  - Can use st_clip() or map_algebra
- Amalgamate coverage specific rasters into a single continuous daily dataset for model meteorology input generation.
amalgamate: Create new amalgamated raster dataset from multiple data sources ranked in the assess step
- download Store selected data source for each grid cell + time period:
  - Obtain desired base data sets and their fidelity metric files
- process: Transform map data to model spatial and temporal scale
  - 01_acquire Merge the various method ranking files (1 column per method) to the base timeseries period (hourly) for the coverage
  - 02_rank Create a ranking column in file (from above) that selects the data set with the highest ranking for that time period from the individual method columns.
    - Ex: https://github.com/HARPgroup/HARParchive/blob/master/HARP-2024-2025/methods/simple_monthly_lm.R
    - met/[scenario].con file will define original datasources AND rating methods to use.
    - Output File:
  - 03_base_raster: create a base raster in the database for the amalgamated scenario
    - Assign by coverage + data source + time period
    - Store as Timeseries or as values in a raster (integer values to refer to data source?)
    - Store by vector point location (a la previous model which stored as CSVs by grid cell)
      - Store as timeseries value attached to point feature
    - Store as an amalgamated raster https://github.com/HARPgroup/model_meteorology/issues/55
      - Create base raster:
        
        Clone NLDAS raster for timestep as base
        
        Or, create No-data version of NLDAS2 raster as base
        
        Or, create No-Data version of daymet raster as base (since it is the highest resolution grid)
      - Super-impose all grid cells for each watershed coverage (in order by total DA to insure overlapping in correct order)
      - Or, do the reverse, i.e., assemble basin overlapping grid cells then fill in blanks with base NLDAS2 raster values
    - TODO: (rasters OR csv) determine data model for storing unique, scenario based in dh_timeseries_weather/dh_timeseries
      1. Add a property for the model onto the coverage feature
      2. Add a scenario sub-prop to the feature model
      3. Save dh_timeseries_weather entries attached to entity_type='dh_properties', featureid=[scenario pid]
      4. All amalgamates are the same varkey on the scenario sub-prop where scenario is something like simple_lm
- import
  - Import amalgamated dataset into postgresql database (may take place in process)
- analyze- Analyze
  - QA
  - Summary metrics for inspection
  - Generate case-studies of model fit for Met/USGS

First attempt at amalgamating a raster. Pulls in clipped, resampled rasters for daymet and PRISM. The "best" dataset below is picked randomly, this will need to be updated as values are added to REST. Unions NLDAS rasters prior to resampling and clipping. This is the longest part of the query, adding roughly 10 seconds. For one day, this query takes about 15 seconds. Scaling this to the full time period would result take about 60 hours (15 sec/day 365 day/yr 39 yr / 3600 sec/hours). To-do:

[X] Clip amalgamated raster to ~3 small watersheds and confirm results are identical to those from calc_raster_ts. Small watersheds will not have additional values overlaid on top of them and should thus remain identical
[x] Check a larger watershed where we have full coverage, ideally one that is half driven by one dataset and half by another to see if the amalgamate is working as intended
[x] Set "background" raster to NLDAS union
[ ] Temporally disaggregate based on the fraction or a 24-hour average rainfall per #86
[ ] Count number of times the 24-hour default option is used
[ ] Add real ratings view the properties or TS added via REST

--Given we have a file that indicates which dataset performed best, we should have
--three different datasets defined via WITH, grabbing and clipping coverages based on the file

--Arbitrarily pick a tsendtime to practice with, here February 18, 2020. 
--Note that with NLDAS this will pick an abitrary hour and we need the full 24-hour set
\set tsendin '1582027200'
\set resample_varkey 'daymet_mod_daily'
-- sets all integer feature and varid with query 
select hydroid as covid from dh_feature where hydrocode = 'cbp6_met_coverage' \gset

--Grab the USGS full drainage geometries/coverages and assign ratings to inicate best 
--performing precip dataset
WITH usgsCoverage as (
    SELECT f.*,
    --Join in ratings. Until ratings are put in db via REST, let's
    --use a random integer between 0 (NLDAS) and 2 (daymet)
    floor(random() * (2-0+1) + 0)::int as dataID,
    --Add area of watershed/coverage for refernce and to order from downstream to upstrea
    ST_AREA(fgeo.dh_geofield_geom) as covArea,
    fgeo.dh_geofield_geom as dh_geofield_geom
    FROM dh_feature as f
    LEFT JOIN field_data_dh_geofield as fgeo
    on (
        fgeo.entity_id = f.hydroid
        and fgeo.entity_type = 'dh_feature' 
    ) 
    WHERE f.bundle = 'watershed' AND f.ftype = 'usgs_full_drainage'
    ORDER BY covArea DESC
),
--Get the geometry and feature fields for the full coverage based on the covid variable gset above. 
--This will be used to create a resampled NLDAS for the day
fullCoverage as (
    SELECT f.*,fgeo.dh_geofield_geom
    FROM dh_feature as f
    LEFT JOIN field_data_dh_geofield as fgeo
    on (
        fgeo.entity_id = f.hydroid
        and fgeo.entity_type = 'dh_feature' 
    ) 
    WHERE f.hydroid = :'covid'
),
--Where PRISM is the best performing dataset, grab the appropriate
--dialy raster from dh_weather_timeseries and resample to target resolution
--and then clip to watershed boundaries
prism as (
    SELECT cov.*,
    met.featureid,met.tsendtime,
    st_clip(st_resample(met.rast,rt.rast), cov.dh_geofield_geom) as rast
    FROM usgsCoverage as cov
    JOIN(
        select *
        from dh_timeseries_weather as met
        left outer join dh_variabledefinition as b
            on (met.varid = b.hydroid) 
        where b.varkey='prism_mod_daily'
            and met.featureid = :covid
            and met.tsendtime = :'tsendin'
    ) AS met
    ON ST_Intersects(ST_ConvexHull(met.rast),cov.dh_geofield_geom)
    LEFT JOIN (select rast from raster_templates where varkey = :'resample_varkey') as rt
    ON 1 = 1
    WHERE cov.dataID = 1
),
--Where daymet is the best performing dataset, grab the appropriate
--daily raster from dh_weather_timeseries and resample to target resolution
--and then clip to watershed boundaries
daymet as (
    SELECT cov.*,
    met.featureid,met.tsendtime,
    st_clip(st_resample(met.rast,rt.rast), cov.dh_geofield_geom) as rast
    FROM usgsCoverage as cov
    JOIN(
        select *
        from dh_timeseries_weather as met
        left outer join dh_variabledefinition as b
            on (met.varid = b.hydroid) 
        where b.varkey='daymet_mod_daily'
            and met.featureid = :covid
            and met.tsendtime = :'tsendin'
    ) AS met
    ON ST_Intersects(ST_ConvexHull(met.rast),cov.dh_geofield_geom)
    LEFT JOIN (select rast from raster_templates where varkey = :'resample_varkey') as rt
    ON 1 = 1
    WHERE cov.dataID = 2
),
--Union all NLDAS rasters for the day to get the sum of NLDAS for the day
nldasFullDay AS (
    SELECT st_union(met.rast,'sum') as rast
    FROM (
        select *
        from dh_timeseries_weather as met
        left outer join dh_variabledefinition as b
            on (met.varid = b.hydroid) 
        where b.varkey='nldas2_precip_hourly_tiled_16x16'
            and met.featureid = :covid
            and extract(year from to_timestamp(met.tsendtime)) = extract(year from to_timestamp(:'tsendin'))
            and extract(month from to_timestamp(met.tsendtime)) = extract(month from to_timestamp(:'tsendin'))
            and extract(day from to_timestamp(met.tsendtime)) = extract(day from to_timestamp(:'tsendin'))
    ) AS met
),
nldasFullDayResamp AS (
    select st_resample(met.rast,rt.rast) as rast
    FROM fullCoverage as f
    JOIN nldasFullDay as met
    ON ST_ConvexHull(met.rast) && f.dh_geofield_geom
    LEFT JOIN (select rast from raster_templates where varkey = :'resample_varkey') as rt
    ON 1 = 1
),
--Union all NLDAS rasters for the day, intersecting by the usgsCoverage geometries
--to leverage the tiled NLDAS rasters. The end result is a raster for each coverage 
--where NLDAS is the most highly rated that is of the full day's dataset, 
--but clipped to only intersecting tiles
nldasDay as (
    SELECT cov.hydroid, cov.hydrocode,
    cov.ftype, cov.bundle, cov.name,
    :'tsendin' as tsendtime,
    st_union(met.rast,'sum') as rast
    FROM usgsCoverage as cov
    JOIN(
        select *
        from dh_timeseries_weather as met
        left outer join dh_variabledefinition as b
            on (met.varid = b.hydroid) 
        where b.varkey='nldas2_precip_hourly_tiled_16x16'
            and met.featureid = :covid
            and extract(year from to_timestamp(met.tsendtime)) = extract(year from to_timestamp(:'tsendin'))
            and extract(month from to_timestamp(met.tsendtime)) = extract(month from to_timestamp(:'tsendin'))
            and extract(day from to_timestamp(met.tsendtime)) = extract(day from to_timestamp(:'tsendin'))
    ) AS met
    ON ST_Intersects(ST_ConvexHull(met.rast),cov.dh_geofield_geom)
    WHERE cov.dataID = 0
    GROUP BY cov.hydroid, cov.hydrocode, cov.ftype,
        cov.bundle, cov.name
),
--Now, using the union of NLDAS hourly data in nldasDay, resample to the template raster and clip to each 
--watershed where NLDAS is rated the best via an INNER JOIN and the WHERE in nldasDay
nldas as (
    SELECT cov.*,met.tsendtime,
    st_clip(st_resample(met.rast,rt.rast), cov.dh_geofield_geom) as rast
    FROM usgsCoverage as cov
    INNER JOIN nldasDay as met
    on cov.hydroid = met.hydroid
    LEFT JOIN (select rast from raster_templates where varkey = :'resample_varkey') as rt
    ON 1 = 1
),
--For each feature in usgsCoverage, find the non-NULL summary dataset (which will represent the best rated dataset)
amalgamate as (
    select cov.*, COALESCE(prismMet.rast,daymetMet.rast,nldasMet.rast) as rast
    FROM usgsCoverage as cov
    LEFT JOIN nldas as nldasMet
    on cov.hydroid = nldasMet.hydroid
    LEFT JOIN prism as prismMet
    on cov.hydroid = prismMet.hydroid
    LEFT JOIN daymet as daymetMet
    on cov.hydroid = daymetMet.hydroid
),
--Union the best rated datasets together. Since the data is sorted by drainage area, 
--upstream areas will populate after larger, downstream coverages
amalgamateUnion as (
    SELECT ST_union(amalgamate.rast) as rast
    FROM usgsCoverage as cov
    LEFT JOIN amalgamate as amalgamate
    on cov.hydroid = amalgamate.hydroid
)
--Use a full union to create a column with the amalgamateUnion raster and the nldasFulLDayResamp raster
--Then, union the rasters to get a raster in which the "background" is the nldasFullDayResamp and everything else is 
--the best fit raster
SELECT ST_union(fullUnion.rast) as rast
FROM (
    SELECT rast FROM nldasFullDayResamp
    UNION ALL
    SELECT rast FROM amalgamateUnion 
) as fullUnion;

Testing the above raster. Using random numbers to assign best fit dataset (for now):	hydroid	hydrocode	ftype	bundle	name	dataid
289982	usgs_ws_02029000	usgs_full_drainage	watershed	JAMES RIVER AT SCOTTSVILLE, VA	0	1.2134
290060	usgs_ws_03176500	usgs_full_drainage	watershed	NEW RIVER AT GLEN LYN, VA	2	0.9902
290065	usgs_ws_02026000	usgs_full_drainage	watershed	JAMES RIVER AT BENT CREEK, VA	1	0.9666

So, let's find the mean precipitation in this watershed using the resampled PRISM data from dh_timeseries_weather

Check `dh_timeseries_weather` PRISM data

``` \set hydrocode 'usgs_ws_02026000' \set band '1' \set tsendin '1582027200' --\set varkey 'nldas2_precip_hourly_tiled_16x16' \set varkey 'prism_mod_daily' --\set varkey 'daymet_mod_daily' \set resample_varkey 'daymet_mod_daily' -- sets all integer feature and varid with query select hydroid as covid from dh_feature where hydrocode = 'cbp6_met_coverage' \gset WITH usgs_features AS ( SELECT * FROM dh_feature WHERE hydrocode = :'hydrocode' ), metUnion as ( Select met.featureid, st_union(met.rast,'sum') as rast FROM usgs_features as f left outer join field_data_dh_geofield as fgeo on ( fgeo.entity_id = f.hydroid and fgeo.entity_type = 'dh_feature' ) JOIN( select * from dh_timeseries_weather as met left outer join dh_variabledefinition as b on (met.varid = b.hydroid) where b.varkey=:'varkey' and met.featureid = :covid and extract(year from to_timestamp(met.tsendtime)) = extract(year from to_timestamp(:'tsendin')) and extract(month from to_timestamp(met.tsendtime)) = extract(month from to_timestamp(:'tsendin')) and extract(day from to_timestamp(met.tsendtime)) = extract(day from to_timestamp(:'tsendin')) ) AS met ON fgeo.dh_geofield_geom && met.bbox group by met.featureid ), met as ( Select met.featureid, to_timestamp(:'tsendin') as obs_date, extract(year from to_timestamp(:'tsendin')) as yr, extract(month from to_timestamp(:'tsendin')) as mo, extract(day from to_timestamp(:'tsendin')) as da, extract(hour from to_timestamp(:'tsendin')) as hr, (ST_summarystats(st_clip(st_resample(met.rast,rt.rast), fgeo.dh_geofield_geom), :'band', TRUE)).mean as stats FROM usgs_features as f left outer join field_data_dh_geofield as fgeo on ( fgeo.entity_id = f.hydroid and fgeo.entity_type = 'dh_feature' ) JOIN metUnion AS met ON ST_Intersects(ST_ConvexHull(met.rast),fgeo.dh_geofield_geom) left join (select rast from raster_templates where varkey = :'resample_varkey') as rt ON 1 = 1 ) select featureid, obs_date, yr, mo, da, hr, 0.0393701 * stats precip_in from met order by met.obs_date; ```

featureid | obs_date | yr | mo | da | hr | precip_in -----------+------------------------+------+----+----+----+-------------------- 617850 | 1981-09-16 08:00:00-04 | 1981 | 9 | 16 | 8 | 0.6558419584256406

Now let's find the mean precipitation in this watershed using the amalgamated data from tmp_amalgamate

Check `amalgamate` data

``` WITH usgs_features AS ( SELECT * FROM dh_feature WHERE hydrocode = :'hydrocode' ), met as ( Select f.hydroid as featureid, to_timestamp(:'tsendin') as obs_date, extract(year from to_timestamp(:'tsendin')) as yr, extract(month from to_timestamp(:'tsendin')) as mo, extract(day from to_timestamp(:'tsendin')) as da, extract(hour from to_timestamp(:'tsendin')) as hr, (ST_summarystats(st_clip(met.rast, fgeo.dh_geofield_geom), :'band', TRUE)).mean as stats FROM usgs_features as f left outer join field_data_dh_geofield as fgeo on ( fgeo.entity_id = f.hydroid and fgeo.entity_type = 'dh_feature' ) JOIN tmp_amalgamate AS met ON ST_Intersects(ST_ConvexHull(met.rast),fgeo.dh_geofield_geom) ) select featureid, obs_date, yr, mo, da, hr, 0.0393701 * stats precip_in from met order by met.obs_date; ```

featureid | obs_date | yr | mo | da | hr | precip_in -----------+------------------------+------+----+----+----+-------------------- 290065 | 1981-09-16 08:00:00-04 | 1981 | 9 | 16 | 8 | 0.5779727155453814 (1 row)

These different values actually make sense. This is a larger watershed and will thus change after amalgamation as smaller watersheds are unioned on top. We instead need to find a headwater watershed, which should remain unchanged. For instance, usgs_ws_01645784 may work as it is our smallest watershed by area (used daymet in this run) and usgs_ws_01654500 (which used PRISM in this run). Using the above code to evaluate usgs_ws_01654500, we see an identical value between amalgamate and dh_timeseries_weather's PRISM data. Similarly, usgs_ws_01645784 produced identical data to dh_timeseries_weather's daymet data. usgs_ws_02055100 is larger, but used NLDAS data in this random allocation of ratings. Using the code above, we again see agreement between dh_timeseries_weather and the amalgamate method above

This is great progress @COBrogan thanks for pushing it along! I was today days old when I learned the SQL term COALESCE so thanks!

I had a couple questions and speculations (and may reveal my misunderstanding):

could your headwater data not yield the same data since the selection if best fit is random (for this demo)?
I wonder if we can improve performance by using the daily sum raster cor nldas2 that @mwdunlap2004 is working on so it's just another prepopulated varkey?
I think you have squeezed out at least some of the performance since your WITH queries are o ly grabbing the data for days when they're the selected source.
are you actually pulling in all USGS coverages for the amalgamated layers? I got a little lost between usgsCoverage, which seemms to grab all shapes, and fullCoverage which seems to limit to a larger watershed of interest.
if you are actually grabbing all the gage covs, do we need to use the single coverage in fullCoverage or is that just to get the base raster? I think I answered my own Question but humor me please 😝
All in all, a 60 hour amalgamation process for the entire domain does not seem outrageous to me -- of course I will soon want to do it in 45 minutes but that is the way of things!

could your headwater data not yield the same data since the selection if best fit is random (for this demo)?

The headwaters are revealing the same data as the data from the randomly selected dataset. I created some temp tables to test this out but that code isn’t here (but I have it on OneDrive if I need to refer to it again).

I wonder if we can improve performance by using the daily sum raster cor nldas2 that @mwdunlap2004 is working on so it's just another prepopulated varkey?

Absolutely! Those raster will reduce run time by at least 33% I think.

I think you have squeezed out at least some of the performance since your WITH queries are o ly grabbing the data for days when they're the selected source.

That’s what I’m hoping! I tried to put a comment block above each ‘WITH’ to help it be a little more legible bc the WITHs are difficult to read.

are you actually pulling in all USGS coverages for the amalgamated layers? I got a little lost between usgsCoverage, which seemms to grab all shapes, and fullCoverage which seems to limit to a larger watershed of interest.

‘fullCoverage’ is the union of nldas for the whole CB extent. It’s unioned at the end to serve as the “background” data is no better data exists

if you are actually grabbing all the gage covs, do we need to use the single coverage in fullCoverage or is that just to get the base raster? I think I answered my own Question but humor me please 😝

To get the base raster. There may be a better way when we finish with temporal disaggregateion

All in all, a 60 hour amalgamation process for the entire domain does not seem outrageous to me -- of course I will soon want to do it in 45 minutes but that is the way of things!

Hopefully we speed it up!

I’ll be working on confirming that the larger watersheds are being overwritten by smaller headwaters using st point next week.

Thinking about steps process -> 01_acquire and process -> 02_rank.

01_acquire could use the existing join_col.R script and just iterate through different scenario ranking files, joining them sequentially to the base met time series files and saving as the amalgamate ranking file
02_rank could then simply look at the various ranking columns and indicate the method with the highest value for each. This can also be saved into the same text file.
- can the best_available column be populated with the varid of the selected timeseries? This would make joining straightforward in the final database SQL
Then, the resulting file can be imported as a timeseries record with only the final ranking indicator as the tsvalue

@COBrogan Putting a little thought into how ratings could be included as part of the WITH section. Here it is in SQL form:

WITH ratings as (
  select f.hydroid, scen.propname, 
    rts.tstime, rts.tsendtime, rts.tsvalue as best_varid
  from usgs_coverage as f
  left outer join dh_properties as scen 
  on (
    scen.featureid = f.hydroid
    and scen.propname = :'ama_scenario'
  )
)

Then, we could collapse the WITH statements for prism and daymet into a single generic query where we simply selected varid joined to ratings.best_varid like so:

BESTmet as (
    SELECT cov.*,
    met.featureid,met.tsendtime,
    st_clip(st_resample(met.rast,rt.rast), cov.dh_geofield_geom) as rast
    FROM usgsCoverage as cov
    JOIN(
        select *
        from dh_timeseries_weather as met
        left outer join dh_timeseries as b
            on (
          met.varid = b.best_varid
          and met.tstime >= b.tstime
          and met.tsendtime <= b.tsendtime
        ) 
        where 
            and met.featureid = :covid
            and met.tsendtime = :'tsendin'
    ) AS met
    ON ST_Intersects(ST_ConvexHull(met.rast),cov.dh_geofield_geom)
    LEFT JOIN (select rast from raster_templates where varkey = :'resample_varkey') as rt
    ON 1 = 1
    WHERE cov.dataID = 2
),

For more information on how I think we might go about creating those pre-amalgamated ratings for each coverage see #96

@mwdunlap2004 @rburghol Using st_value, I checked the values of the amalgamated raster above at a few locations. My goal was to ensure the amalgamation was working so I looked at points in a large watershed (e.g. usgs_ws_02035000 James River at Cartersville) at:

[X] A point (37.792314, -78.157391) that does not overlap any other watershed. Here, the value of the amalgamated raster should be identical to the value of the raster for that varkey as stored in dh_timeseries_weather e.g. the value should be identical to the raw data for this usgs_ws_02035000
[X] A point (37.923872, -78.587890) that is overlapped by an upstream watershed (e.g. usgs_ws_02030000 Hardware River BL Briery Run nr Scottsville, VA). Here, the value of the amalgamated raster should be identical to the value of the raster for that varkey as stored in dh_timeseries_weather for the usgs_ws_02030000 raster e.g. the value should be identical to the Hardware River the raw data
[X] A point (40.299638, -76.909790) that is outside of our USGS full drainage areas. Here, the value of the amalgamated raster should be identical to the "background" value of the raster e.g. the value should be identical to the daily union of the nldas2 raw data for that day in dh_timeseries_weather.

In each case, I found that the amalgamated raster was working. I think the amalgamation is working as intended!

See here for a modified amalgamate code that creates temp tables for the ratings and amalgamate raster for later exploration. I used this to test amalgamate in three scenarios above, but please note that the ratings are still randomly assigned such that each time you run this you will need to check the datasources selected in scenarios 1 and 2:

Temp Table Amalgamate

``` sql --Given we have a file that indicates which dataset performed best, we should have --three different datasets defined via WITH, grabbing and clipping coverages based on the file --Arbitrarily pick a tsendtime to practice with, here September 16, 1981. --Note that with NLDAS this will pick an abitrary hour and we need the full 24-hour set \set tsendin '369489600' \set resample_varkey 'daymet_mod_daily' -- sets all integer feature and varid with query select hydroid as covid from dh_feature where hydrocode = 'cbp6_met_coverage' \gset create temp table tmp_usgsCoverage as ( SELECT f.*, --Join in ratings. Until ratings are put in db via REST, let's --use a random integer between 0 (NLDAS) and 2 (daymet) floor(random() * (2-0+1) + 0)::int as dataID, --Add area of watershed/coverage for refernce and to order from downstream to upstrea ST_AREA(ST_MakeValid(fgeo.dh_geofield_geom)) as covArea, fgeo.dh_geofield_geom as dh_geofield_geom FROM dh_feature as f LEFT JOIN field_data_dh_geofield as fgeo on ( fgeo.entity_id = f.hydroid and fgeo.entity_type = 'dh_feature' ) WHERE f.bundle = 'watershed' AND f.ftype = 'usgs_full_drainage' ORDER BY covArea DESC ) create temp table tmp_amalgamate as ( --Grab the USGS full drainage geometries/coverages and assign ratings to inicate best --performing precip dataset WITH usgsCoverage as ( SELECT * FROM tmp_usgsCoverage ), --Get the geometry and feature fields for the full coverage based on the covid variable gset above. --This will be used to create a resampled NLDAS for the day fullCoverage as ( SELECT f.*,fgeo.dh_geofield_geom FROM dh_feature as f LEFT JOIN field_data_dh_geofield as fgeo on ( fgeo.entity_id = f.hydroid and fgeo.entity_type = 'dh_feature' ) WHERE f.hydroid = :'covid' ), --Where PRISM is the best performing dataset, grab the appropriate --dialy raster from dh_weather_timeseries and resample to target resolution --and then clip to watershed boundaries prism as ( SELECT cov.*, met.featureid,met.tsendtime, st_clip(st_resample(met.rast,rt.rast), cov.dh_geofield_geom) as rast FROM usgsCoverage as cov JOIN( select * from dh_timeseries_weather as met left outer join dh_variabledefinition as b on (met.varid = b.hydroid) where b.varkey='prism_mod_daily' and met.featureid = :covid and met.tsendtime = :'tsendin' ) AS met ON ST_Intersects(ST_ConvexHull(met.rast),cov.dh_geofield_geom) LEFT JOIN (select rast from raster_templates where varkey = :'resample_varkey') as rt ON 1 = 1 WHERE cov.dataID = 1 ), --Where daymet is the best performing dataset, grab the appropriate --daily raster from dh_weather_timeseries and resample to target resolution --and then clip to watershed boundaries daymet as ( SELECT cov.*, met.featureid,met.tsendtime, st_clip(st_resample(met.rast,rt.rast), cov.dh_geofield_geom) as rast FROM usgsCoverage as cov JOIN( select * from dh_timeseries_weather as met left outer join dh_variabledefinition as b on (met.varid = b.hydroid) where b.varkey='daymet_mod_daily' and met.featureid = :covid and met.tsendtime = :'tsendin' ) AS met ON ST_Intersects(ST_ConvexHull(met.rast),cov.dh_geofield_geom) LEFT JOIN (select rast from raster_templates where varkey = :'resample_varkey') as rt ON 1 = 1 WHERE cov.dataID = 2 ), --Union all NLDAS rasters for the day to get the sum of NLDAS for the day nldasFullDay AS ( SELECT st_union(met.rast,'sum') as rast FROM ( select * from dh_timeseries_weather as met left outer join dh_variabledefinition as b on (met.varid = b.hydroid) where b.varkey='nldas2_precip_hourly_tiled_16x16' and met.featureid = :covid and extract(year from to_timestamp(met.tsendtime)) = extract(year from to_timestamp(:'tsendin')) and extract(month from to_timestamp(met.tsendtime)) = extract(month from to_timestamp(:'tsendin')) and extract(day from to_timestamp(met.tsendtime)) = extract(day from to_timestamp(:'tsendin')) ) AS met ), nldasFullDayResamp AS ( select st_resample(met.rast,rt.rast) as rast FROM fullCoverage as f JOIN nldasFullDay as met ON ST_ConvexHull(met.rast) && f.dh_geofield_geom LEFT JOIN (select rast from raster_templates where varkey = :'resample_varkey') as rt ON 1 = 1 ), --Union all NLDAS rasters for the day, intersecting by the usgsCoverage geometries --to leverage the tiled NLDAS rasters. The end result is a raster for each coverage --where NLDAS is the most highly rated that is of the full day's dataset, --but clipped to only intersecting tiles nldasDay as ( SELECT cov.hydroid, cov.hydrocode, cov.ftype, cov.bundle, cov.name, :'tsendin' as tsendtime, st_union(met.rast,'sum') as rast FROM usgsCoverage as cov JOIN( select * from dh_timeseries_weather as met left outer join dh_variabledefinition as b on (met.varid = b.hydroid) where b.varkey='nldas2_precip_hourly_tiled_16x16' and met.featureid = :covid and extract(year from to_timestamp(met.tsendtime)) = extract(year from to_timestamp(:'tsendin')) and extract(month from to_timestamp(met.tsendtime)) = extract(month from to_timestamp(:'tsendin')) and extract(day from to_timestamp(met.tsendtime)) = extract(day from to_timestamp(:'tsendin')) ) AS met ON ST_Intersects(ST_ConvexHull(met.rast),cov.dh_geofield_geom) WHERE cov.dataID = 0 GROUP BY cov.hydroid, cov.hydrocode, cov.ftype, cov.bundle, cov.name ), --Now, using the union of NLDAS hourly data in nldasDay, resample to the template raster and clip to each --watershed where NLDAS is rated the best via an INNER JOIN and the WHERE in nldasDay nldas as ( SELECT cov.*,met.tsendtime, st_clip(st_resample(met.rast,rt.rast), cov.dh_geofield_geom) as rast FROM usgsCoverage as cov INNER JOIN nldasDay as met on cov.hydroid = met.hydroid LEFT JOIN (select rast from raster_templates where varkey = :'resample_varkey') as rt ON 1 = 1 ), --For each feature in usgsCoverage, find the non-NULL summary dataset (which will represent the best rated dataset) amalgamate as ( select cov.*, COALESCE(prismMet.rast,daymetMet.rast,nldasMet.rast) as rast FROM usgsCoverage as cov LEFT JOIN nldas as nldasMet on cov.hydroid = nldasMet.hydroid LEFT JOIN prism as prismMet on cov.hydroid = prismMet.hydroid LEFT JOIN daymet as daymetMet on cov.hydroid = daymetMet.hydroid ), --Union the best rated datasets together. Since the data is sorted by drainage area, --upstream areas will populate after larger, downstream coverages amalgamateUnion as ( SELECT ST_union(amalgamate.rast) as rast FROM usgsCoverage as cov LEFT JOIN amalgamate as amalgamate on cov.hydroid = amalgamate.hydroid ) --Use a full union to create a column with the amalgamateUnion raster and the nldasFulLDayResamp raster --Then, union the rasters to get a raster in which the "background" is the nldasFullDayResamp and everything else is --the best fit raster SELECT ST_union(fullUnion.rast) as rast FROM ( SELECT rast FROM nldasFullDayResamp UNION ALL SELECT rast FROM amalgamateUnion ) as fullUnion ); ```

In my case, I found that daymet was randomly selected to be the "rating" for usgs_ws_02035000:

select dataid from tmp_usgsCoverage where hydrocode = 'usgs_ws_02035000';
 dataid
--------
      2

PRISM was selected as the random "rating" for usgs_ws_02030000:

select dataid from tmp_usgsCoverage where hydrocode = 'usgs_ws_02030000';
 dataid
--------
      1

First, let's check what value the amalgamated raster has in scenario 1 above (at a point in usgs_ws_02035000 that does not overlap any other watershed). Here, we should see that the amalgamated raster is equal to the raw data in dh_timeseries_weather for daymet. The value in amalgamated raster: (set some variables first for ease of checking)

\set hydrocode 'cbp6_met_coverage'
\set band '1'
\set tsendin '369489600'
\set varkey 'daymet_mod_daily'
\set resample_varkey 'daymet_mod_daily'
--Set point to extract
\set latitude 37.792314
\set longitude -78.157391
-- sets all integer feature and varid with query 
select hydroid as covid from dh_feature where hydrocode = 'cbp6_met_coverage' \gset

WITH usgs_features AS (
  --SELECT * 
  --FROM  dh_feature as f
  --WHERE f.bundle = 'watershed' AND f.ftype = 'usgs_full_drainage'
  SELECT * 
  FROM  dh_feature as f
    left outer join field_data_dh_geofield as fgeo
    on (
        fgeo.entity_id = f.hydroid
        and fgeo.entity_type = 'dh_feature' 
    ) 
  WHERE f.hydrocode = :'hydrocode'
),
testPoint as (
    SELECT ST_SetSRID(ST_Point( :'longitude', :'latitude'),4326) as testPoint
),
met as (
    Select f.hydrocode as hydrocode, to_timestamp(:'tsendin') as obs_date,
        extract(year from to_timestamp(:'tsendin')) as yr,
        extract(month from to_timestamp(:'tsendin')) as mo,
        extract(day from to_timestamp(:'tsendin')) as da,
        extract(hour from to_timestamp(:'tsendin')) as hr,
        st_value(met.rast, :'band', testPoint.testPoint) as stats
    FROM usgs_features as f
    JOIN tmp_amalgamate AS met
    ON ST_Intersects(ST_ConvexHull(met.rast),ST_SetSRID(f.dh_geofield_geom,4326))
    JOIN testPoint as testPoint
    ON ST_INTERSECTS(testPoint.testPoint,ST_SetSRID(f.dh_geofield_geom,4326))
)
select hydrocode, obs_date, yr, mo, da, hr, 
  0.0393701 * stats precip_in
from met
order by met.obs_date;

hydrocode	obs_date	yr	mo	da	hr	precip_in
usgs_ws_02035000	1981-09-16 08:00:00-04	1981	9	16	8	0.8047248650259018

We compare this to the value for daymet from dh_timeseries_weather. Notice we supply a hydrocode here to speed up the query:

WITH usgs_features AS (
  SELECT * 
  FROM  dh_feature as f
    left outer join field_data_dh_geofield as fgeo
    on (
        fgeo.entity_id = f.hydroid
        and fgeo.entity_type = 'dh_feature' 
    ) 
  WHERE f.hydrocode = :'hydrocode'
),
testPoint as (
    SELECT ST_SetSRID(ST_Point( :'longitude', :'latitude'),4326) as testPoint
),
metUnion as (
    Select 
        st_union(met.rast,'sum') as rast
    FROM usgs_features as f
    JOIN testPoint as testPoint
    ON 1 = 1
    JOIN(
        select *
        from dh_timeseries_weather as met
        left outer join dh_variabledefinition as b
            on (met.varid = b.hydroid) 
        where b.varkey=:'varkey'
            and met.featureid = :covid
            and extract(year from to_timestamp(met.tsendtime)) = extract(year from to_timestamp(:'tsendin'))
            and extract(month from to_timestamp(met.tsendtime)) = extract(month from to_timestamp(:'tsendin'))
            and extract(day from to_timestamp(met.tsendtime)) = extract(day from to_timestamp(:'tsendin'))
    ) AS met
    ON ST_INTERSECTS(testPoint.testPoint, met.bbox)
),
met as (
    Select f.hydrocode, to_timestamp(:'tsendin') as obs_date,
        extract(year from to_timestamp(:'tsendin')) as yr,
        extract(month from to_timestamp(:'tsendin')) as mo,
        extract(day from to_timestamp(:'tsendin')) as da,
        extract(hour from to_timestamp(:'tsendin')) as hr,
        st_value(st_resample(met.rast,rt.rast), :'band', testPoint.testPoint) as stats
    FROM usgs_features as f
    JOIN metUnion AS met
    ON ST_Intersects(ST_ConvexHull(met.rast),ST_SetSRID(f.dh_geofield_geom,4326))
    JOIN testPoint as testPoint
    ON ST_INTERSECTS(testPoint.testPoint,ST_SetSRID(f.dh_geofield_geom,4326))
    left join (select rast from raster_templates where varkey = :'resample_varkey') as rt
    ON 1 = 1
)
select hydrocode, obs_date, yr, mo, da, hr, 
  0.0393701 * stats precip_in
from met
order by met.obs_date;

featureid	obs_date	yr	mo	da	hr	precip_in
617850	1981-09-16 08:00:00-04	1981	9	16	8	0.8047248650259018

Success! We have matching values. Now, what about in a point in hydrocode usgs_ws_02030000 that overlaps usgs_ws_02035000? Here, the amalgamated raster should be equal to the value in the raw PRISM data as that was "selected" for this upstream watershed and should have been unioned on top of the daymet data in the amalgamated raster. Rerun the above amalgamate code now with:

\set latitude 37.923872
\set longitude -78.587890
\set varkey 'prism_mod_daily'

hydrocode	obs_date	yr	mo	da	hr	precip_in
usgs_ws_02030000	1981-09-16 08:00:00-04	1981	9	16	8	0.6486224125185013

We now rerun the query against dh_timeseries_weather using the PRISM data and the updated coordinates. We expect and identical value for precip:

featureid	obs_date	yr	mo	da	hr	precip_in
617850	1981-09-16 08:00:00-04	1981	9	16	8	0.6486224125185013

Success! The smaller watershed was correctly amalgamated on top of the larger, downstream watershed. With our last case, we want to ensure that the "background" raster exists where there is no USGS coverage. In the amalgamated raster, this should be:

\set latitude 40.299638
\set longitude -76.909790
\set varkey 'nldas2_precip_hourly_tiled_16x16'

hydrocode	obs_date	yr	mo	da	hr	precip_in
cbp6_met_coverage	1981-09-16 08:00:00-04	1981	9	16	8	0.28930330582999997

Now let's compare to the tiled nldas data in dh_timeseries_weather:

hydrocode	obs_date	yr	mo	da	hr	precip_in
cbp6_met_coverage	1981-09-16 08:00:00-04	1981	9	16	8	0.28930330582999997

Once again, we see the same value. So NLDAS2 is successfully being amalgamated as the "background" precip values.

Putting a little thought into how ratings could be included as part of the WITH section. Here it is in SQL form:

@rburghol I like your idea of how to add in the ratings and the "best" varid. I think that will help clean up our query because that combined PRISM/daymet statement will work well. This is a great way to structure out how to deal with the ratings files and will make the query more applicable to future datasets (currently, we'd have to keep adding to the WITH and we can get around that by storing+using the varid as you indicated). I'll get started on this tomorrow if I can get done a few VPDES tasks, otherwise I'll start early next week.

Amalgamate is unique in that the first steps 01_aquire and 02_rank can be considered coverage dependent but time agnostic based solely on the files output by geo 03 process. But, the amalgamation and knitting of a "best" fit raster are time-dependent and time agnostic. To maintain only three inputs into our workflow meta model (scenario, coverage, directory), we need to put all of the relevant ranking data into our db. In that way, we can call amalgamate with date, pull the relevant dates for all coverages defined in the config, select the "best fit", store the "best fit" in the db, and amalgamate the raster.

Returning to our scheme at the top of this issue:

TODO: (rasters OR csv) determine data model for storing unique, scenario based in dh_timeseries_weather/dh_timeseries Add a property for the model onto the coverage feature Add a scenario sub-prop to the feature model Save dh_timeseries_weather entries attached to entity_type='dh_properties', featureid=[scenario pid] All amalgamates are the same varkey on the scenario sub-prop where scenario is something like simple_lm

flowchart TD
    B(Coverage Feature) --> |hydrocode = usgs_ws_01615000 | A[(dh_feature)]
    C(Model Property) -->|feature id = e.g. 290064 
varkey = 'om_model_object'
 propname = coverage_met propcode = met1.0| B
    D(Scenario Property) -->|entitiy type = dh_property varkey = 'met_model_scenario' propname= e.g. simple_lm| C
    E("raw" ranking timeseries from _geo_) -->|varkey = e.g. ranking_prism| D
    G(CBP Extent Feature) --> |hydrocode = 'cbp6_met_coverage'| A[(dh_feature)]
    C(Model Property) -->|feature id = e.g. 617850 varkey = 'met_model' propcode = amalgamation | G
    H(Scenario Property) -->|entitiy type = dh_property varkey = 'met_model_scenario' propname = e.g. simple_lm| C
    F(best-fit ranking timeseries e.g. amalgamate ranking) --> |varkey = 'best_fit_ranking'|H

New variable definitions for ratings files with varkeys rating_prism, rating_daymet, rating_nldas2, and met_best_rating:

--Variable definition for rating files
insert into dh_variabledefinition(varname,vardesc, vocabulary, varunits, varkey,datatype,varcode,varabbrev,multiplicity)
VALUES
('Best fit met ratings','Best fit ratings timeseries derived from met amalgamate',
 'met_best', '', 'met_best_rating','value','met_best_rating','best_rating',''),
('PRISM met model ratings','Ratings derived from met models using PRISM data',
 'met_prism', '', 'rating_prism','value','rating_prism','rating_prism',''),
('daymet met model ratings','Ratings derived from met models using daymet data',
 'met_daymet', '', 'rating_daymet','value','rating_daymet','rating_daymet',''),
('NLDAS2 met model ratings','Ratings derived from met models using NLDAS2 data',
 'met_nldas2', '', 'rating_nldas2','value','rating_nldas2','rating_nldas2','');

Added two steps to geo analyze: 09_importModelPropsToDB = Finds or creates a property for the met1.0 model on the USGS coverage. Then creates a scenario based on the config. Ranking timeseries will be stored under this scenario. 10_importRatingstoDB = Imports the timeseries into dh_weather

Testing on hydrocode usgs_ws_02038000: /opt/model/meta_model/run_model raster_met stormVol_prism usgs_ws_02038000 auto geo import 02 coverage feature id 290098 model pid 7583836 scenario pid 7583843

/opt/model/meta_model/run_model raster_met stormVol_prism usgs_ws_02038000 auto geo import 03 COPY 31

Test on hydrocode usgs_ws_02021500: coverage feature id 289948 model pid 7592953 scenario pid 7592960 COPY 522

~~Currently, UPDATE is not working and is just overwriting all values in range with just one row~~ FIXED

HARPgroup / model_meteorology

Workflow `amalgamate` #66

Steps

Info