Workflow Step: `met` -> `04_hourly_fractions`

Check method to change this workflow into daily data, without temp tables, that inserts into dh_timerseries_weather using methods from calc_raster_ts and timestamp methods

We want to take the work from #86 and convert it into something more similar to raster_calc:

[ ] Change the code to work for a singular day, but with leaving the on and group_by statments
[ ] Use the epoch of tsendtime as the input, along with the varkey, setting the variables ahead of time
[ ] Do this without creating temp tables, but inserting along the way
[ ] Create a delete statement that deletes the specific day we are working on from dh_timeseries_weather
[ ] Questions:
- [ ] if we are doing multiple variables (daily and hourly), what does having the varkey as an input do? Unless we have two separate ones so that if its changed in the future, it doesn't effect the code.
  - Answer: Yes, sounds like we will need to set 2 varkey's, 1) the source (hourly) data nldas2_obs_hourly, 2) the destination (daily data) nldas2_precip_daily.
- [ ] @mwdunlap2004 : are you planning on doing both daily total and hourly fraction in this issue, or do you wish to split them into separate issues (and also maybe separate workflow steps as we discussed). Your choice. We can always smash them together or pull them apart.

Our Delete function for daily_precip

dailyvarkeyID=$1
start_date=-1
end_date=-1

if [ $# -gt 1 ]; then
  start_date=$2
fi

if [ $# -gt 2 ]; then
  end_date=$3
fi

if [ "$start_date" == "-1" ]; then
  start_epoch=-1
else
  start_epoch=$(date -d "$start_date" +%s)
fi

if [ "$end_date" == "-1" ]; then
  end_epoch=-1
else
  end_epoch=$(date -d "$end_date" +%s)
fi

deletedaily="
\\set dailyvarkeyID '$dailyvarkeyID'
\\set start_epoch $start_epoch
\\set end_epoch $end_epoch

delete from tmp_dh_timeseries_weather as data
using dh_variabledefinition as variables
where data.varid = variables.hydroid
  and variables.varkey = :'dailyvarkeyID'
  and ( (data.tstime >= :start_epoch) or (:start_epoch = -1) )
  and ( (data.tsendtime <= :end_epoch) or (:end_epoch = -1) )
;"

# turn off the expansion of the asterisk
set -f
echo -e $deletedaily > $delete_daily_file
cat $delete_daily_file | psql -h $db_host $db_name

Our Insert function for daily_precip

hourlyvarkeyID=$1
dailyvarkeyID=$2
start_date =-1
end_date =-1
if [ $# -gt 2 ]; then
  start_date=$3
fi
if [ $# -gt 3 ]; then
  end_date=$4
fi
if [ "$start_date" == "-1" ]; then
  start_epoch='-1'
else
  start_epoch=$(date -d "$start_date" +%s)
fi
if [ "$end_date" == "-1" ]; then
 end_epoch='-1' # Or leave it undefined
else
  end_epoch=$(date -d "$end_date" +%s)
fi

rastersql="
\\set hourlyvarkeyID '$hourlyvarkeyID' \n
\\set dailyvarkeyID '$dailyvarkeyID' \n
\\set start_epoch $start_epoch
\\set end_epoch $end_epoch

insert into tmp_dh_timeseries_weather ( tstime,tsendtime, featureid, entity_type, rast, bbox, varid)
select min(data.tstime),max(data.tsendtime), data.featureid, data.entity_type, ST_Union(data.rast, 'SUM') as rast, data.bbox, daily.hydroid as varid
from dh_timeseries_weather as data
left outer join tmp_dh_timeseries_weather as dupe
on (
 data.tstime = dupe.tstime
 and data.tsendtime = dupe.tsendtime
 and data.varid = dupe.varid
 and data.bbox = dupe.bbox
)
left outer join dh_variabledefinition as hour
on(
  data.varid = hour.hydroid
)
left outer join dh_variabledefinition as daily
on(
  daily.varkey = :'dailyvarkeyID'
)
where dupe.tid is null
AND hour.varkey = :'hourlyvarkeyID'
and ( (data.tstime >=:start_epoch) or (:start_epoch = -1) )
and ( (data.tsendtime <=:end_epoch) or (:end_epoch = -1) )
GROUP BY data.featureid, data.entity_type, data.bbox, daily.hydroid,
extract(day from to_timestamp(data.tsendtime)), 
extract(year from to_timestamp(data.tsendtime)), 
extract(month from to_timestamp(data.tsendtime));"

# turn off the expansion of the asterisk
set -f
echo -e $raster_sql > $raster_sql_file 
cat $raster_sql_file | psql -h $db_host $db_name

Rob I was thinking of doing both steps in the same workflow, but I think once we have the process completed it could easily be separated. This is my take on our step to create the daily precipitation data in dh_timeseriesweather, it isn't working right now, as it isn't inserting any data, I'm thinking this is related to the varid, in this step, we simultaneously need to make sure our varkey is the hourly data, and that we change it to the daily varid, how do we do that? I did check and there is a hydroid for nldas2_precip_daily, its 1464, I'm just unsure on how to handle us having two different values in this step. I don't think calc_raster has that step, they join on the hydroid, but they don't update it in that step.

varkey=$1 \set varkey '$varkey' \n end_epoch = $2

How to make our epoch

select extract(epoch from ('2021-05-01'::timestamp));

insert into tmp_dh_timeseries_weather ( tstime,tsendtime, featureid, entity_type, rast, bbox, varid)
select min(a.tstime),max(a.tsendtime), a.featureid, a.entity_type, ST_Union(a.rast, 'SUM') as rast, a.bbox, b.hydroid as varid
from dh_timeseries_weather as a
left outer join tmp_dh_timeseries_weather as dupe
on (
 a.tstime = dupe.tstime
 and a.tsendtime = dupe.tsendtime
 and a.varid = dupe.varid
 and a.bbox = dupe.bbox
)
left outer join dh_variabledefinition as b 
on(
  a.varid = b.hydroid AND
  b.varkey = 'varkey'
)
where dupe.tid is null
AND b.varkey = 'nldas2_obs_hourly'
AND extract(date from to_timestamp(a.tsendtime)) = extract(date from to_timestamp(end_epoch))
GROUP BY a.featureid, a.entity_type, a.bbox, b.hydroid,
extract(day from to_timestamp(a.tsendtime)), 
extract(year from to_timestamp(a.tsendtime)), 
extract(month from to_timestamp(a.tsendtime));

@mwdunlap2004 I think there are a few issues here with your psql variables. First, I'm a little confused by these lines:

varkey=$1
\set varkey '$varkey' \n
end_epoch = $2

Are arguments $1 and $2 coming in from the command line? If so, I'd suggest declaring these first and then using \set (basically, switch lines 2 and 3). For best practice, let's be more specific in our declaration of varkey too. We should try to use the brackets notation with string interpolation, though it's not really necessary here (and thus we are not always consistent):

varkey=$1
end_epoch = $2
\set varkey '${varkey}' \n

One reason your query is not working, is the following:

left outer join dh_variabledefinition as b 
on(
  a.varid = b.hydroid AND
  b.varkey = 'varkey'
)

This is literally joining on varkey = 'varkey', so it's looking for all entries where varkey is equal to the word 'varkey'. So not what we want. Instead, you need to call your SQL variable using the colon notation:

left outer join dh_variabledefinition as b 
on(
  a.varid = b.hydroid AND
  b.varkey = :'varkey'
)

As for setting the daily varkey, look at what you are sending to INSERT. Rather than inserting b.hydroid, send a string variable that you need instead e.g. 'dailyvarkeyID' as varid

insert into tmp_dh_timeseries_weather ( tstime,tsendtime, featureid, entity_type, rast, bbox, varid)
select min(a.tstime),max(a.tsendtime), a.featureid, a.entity_type, ST_Union(a.rast, 'SUM') as rast, a.bbox, b.hydroid as varid

@COBrogan great observations and points of style. I would add:

The \set command is part of a quoted string variable raster_sql that is assembled in the script then sent to the database via the line that starts echo $raster_sql | psql...
The lines like varkey=$2 are NOT part of that SQL string variable, and should all be placed at the top of the script before you begin to declare the raster_sql variable.

@mwdunlap2004 per your question about the other varkey/varid you will simply need to change the script to expect a second varkey for the daily variable that you are inserting.

I think you definitely need to separate the creation of the daily total and the hourly fraction into separate scripts. This will simplify debugging, increase readability and also allow us to utilize the hourly-to-daily separately. If you don't separate them you will now need to pass in 3 varkeys as arguments: the hourly data varkey, the hourly fraction varkey and the daily varkey -- another reason that splitting simplifies things.

Based on your suggestions here is my change to the code to insert our daily rasters into dh_timeseries_weather, however I do have a few questions about how this is working. First in this example we set dailyvarkeyID as varid, does this mean we'd want this to be a number '1464' in this case, as opposed to the character string I had used for our fraction raster step previously. What I mean is that in the past we had used a left join to update the records in our table using this new hydroid based on the dh_variabledefintion table and a new varkey.

Second, I am working under the assumption that we would want our epoch to be used as an argument into this function, but would we rather create it when we set the variables? This would mean the argument is the date and then I could do something like:

\set epoch '${extract(epoch from ('date'::timestamp))}' \n

I'm not sure if this would work, but that brings me to my third question, which is how do I declare these variables in my environment? I tried this to set our date, but I'm not sure how to check if it worked, or how to call my variable once I've done this.

Do $$ Declare data_date DATE; BEGIN data_date = '2021-05-01';
END $$;

hourlyvarkeyID=$1 dailyvarkeyID=$2 end_epoch = $3


rastersql="
\\set hourlyvarkeyID '${hourlyvarkeyID}' \n
\\set dailyvarkeyID '${dailyvarkeyID}' \n

insert into tmp_dh_timeseries_weather ( tstime,tsendtime, featureid, entity_type, rast, bbox, varid)
select min(a.tstime),max(a.tsendtime), a.featureid, a.entity_type, ST_Union(a.rast, 'SUM') as rast, a.bbox, :'dailyvarkeyID' as varid
from dh_timeseries_weather as a
left outer join tmp_dh_timeseries_weather as dupe
on (
 a.tstime = dupe.tstime
 and a.tsendtime = dupe.tsendtime
 and a.varid = dupe.varid
 and a.bbox = dupe.bbox
)
left outer join dh_variabledefinition as b 
on(
  a.varid = b.hydroid
)
where dupe.tid is null
AND b.varkey =: 'hourlyvarkeyID'
AND extract(day from to_timestamp(a.tsendtime)) = extract(day from to_timestamp(end_epoch))
AND extract(year from to_timestamp(a.tsendtime)) = extract(year from to_timestamp(end_epoch))
AND extract(month from to_timestamp(a.tsendtime)) = extract(month from to_timestamp(end_epoch))
GROUP BY a.featureid, a.entity_type, a.bbox, b.hydroid,
extract(day from to_timestamp(a.tsendtime)), 
extract(year from to_timestamp(a.tsendtime)), 
extract(month from to_timestamp(a.tsendtime));"

This is my updated code, it uses start and end_date and converts them into epoch, or sets them to -1 if we don't give them a date, it uses our defined hourly and varkeyID,uses two left outer joins it also does the time matching based on our epoch. However, how do I run this code to see if it works? Because I tried to set the variables in SQL, and it said that ' \ ' was an invalid command, is it because this is really a bash command not an SQL one?

hourlyvarkeyID=$1
dailyvarkeyID=$2
start_date =-1
end_date =-1
if [ $# -gt 2 ]; then
  start_date=$3
fi
if [ $# -gt 3 ]; then
  end_date=$4
fi
if [ "$start_date" == "-1" ]; then
  start_epoch='-1'
else
  start_epoch=$(date -d "$start_date" +%s)
fi
if [ "$end_date" == "-1" ]; then
 end_epoch='-1' # Or leave it undefined
else
  end_epoch=$(date -d "$end_date" +%s)
fi

rastersql="
\\set hourlyvarkeyID '$hourlyvarkeyID' \n
\\set dailyvarkeyID '$dailyvarkeyID' \n
\\set start_epoch $start_epoch
\\set end_epoch $end_epoch

insert into tmp_dh_timeseries_weather ( tstime,tsendtime, featureid, entity_type, rast, bbox, varid)
select min(data.tstime),max(data.tsendtime), data.featureid, data.entity_type, ST_Union(data.rast, 'SUM') as rast, data.bbox, daily.hydroid as varid
from dh_timeseries_weather as data
left outer join tmp_dh_timeseries_weather as dupe
on (
 data.tstime = dupe.tstime
 and data.tsendtime = dupe.tsendtime
 and data.varid = dupe.varid
 and data.bbox = dupe.bbox
)
left outer join dh_variabledefinition as hour
on(
  data.varid = hour.hydroid
)
left outer join dh_variabledefinition as daily
on(
  daily.varkey = :'dailyvarkeyID'
)
where dupe.tid is null
AND hour.varkey = :'hourlyvarkeyID'
and ( (data.tstime >= $start_epoch) or ($start_epoch = -1) )
and ( (data.tsendtime <= $end_epoch) or ($end_epoch = -1) )
GROUP BY data.featureid, data.entity_type, data.bbox, daily.hydroid,
extract(day from to_timestamp(data.tsendtime)), 
extract(year from to_timestamp(data.tsendtime)), 
extract(month from to_timestamp(data.tsendtime));"

Here is my attempt at making the delete statement: It takes in similar requests, using the varkeyID and the start and end dates to create our epochs and then deleting the rows in our data that match those values, it also has the default of -1 to delete all the rows, so that if you want all the data for the daily table, it will delete all the rows.

dailyvarkeyID=$1
start_date=-1
end_date=-1

if [ $# -gt 1 ]; then
  start_date=$2
fi

if [ $# -gt 2 ]; then
  end_date=$3
fi

if [ "$start_date" == "-1" ]; then
  start_epoch=-1
else
  start_epoch=$(date -d "$start_date" +%s)
fi

if [ "$end_date" == "-1" ]; then
  end_epoch=-1
else
  end_epoch=$(date -d "$end_date" +%s)
fi

deletedaily="
\\set dailyvarkeyID '$dailyvarkeyID'
\\set start_epoch $start_epoch
\\set end_epoch $end_epoch

delete from tmp_dh_timeseries_weather as data
using dh_variabledefinition as variables
where data.varid = variables.hydroid
  and variables.varkey = :'dailyvarkeyID'
  and ( (data.tstime >= :start_epoch) or (:start_epoch = -1) )
  and ( (data.tsendtime <= :end_epoch) or (:end_epoch = -1) )
;"

@mwdunlap2004 Not sure if this answers your question about set but you only need one slash when running the command manually e.g.

\set testVar
select :testVar

However, when you are writing the Bash command, the backslash has to be escaped so that you need \\set and each has to be on its own line (and thus needs to end with \n). Notice the \n at the end of your first set. In your INSERT query, your start_epoch and end_epoch need to be called using the colon notation in your WHERE statement e.g.

and ( (data.tstime >= :start_epoch) or (:start_epoch = -1) )
and ( (data.tsendtime <= :end_epoch) or (:end_epoch = -1) )

Are you running this line-by-line, or writing the query to file via echo/cat? e.g.

# turn off the expansion of the asterisk
set -f
echo -e $raster_sql > $raster_sql_file 
cat $raster_sql_file | psql -h $db_host $db_name

@COBrogan @rburghol So more along the lines of this? I also don't have quotes around the epoch's right now, is that something I need because I assigned them? I was trying to follow the formatting of calc_raster_ts, but is that the best way to test to see if it works? or is there a different method I can use to run the function? Because if I follow the calc_raster_ts method don't we have to make a config file for it?

hourlyvarkeyID=$1
dailyvarkeyID=$2
start_date =-1
end_date =-1
if [ $# -gt 2 ]; then
  start_date=$3
fi
if [ $# -gt 3 ]; then
  end_date=$4
fi
if [ "$start_date" == "-1" ]; then
  start_epoch='-1'
else
  start_epoch=$(date -d "$start_date" +%s)
fi
if [ "$end_date" == "-1" ]; then
 end_epoch='-1' # Or leave it undefined
else
  end_epoch=$(date -d "$end_date" +%s)
fi

rastersql="
\\set hourlyvarkeyID '$hourlyvarkeyID' \n
\\set dailyvarkeyID '$dailyvarkeyID' \n
\\set start_epoch $start_epoch
\\set end_epoch $end_epoch

insert into tmp_dh_timeseries_weather ( tstime,tsendtime, featureid, entity_type, rast, bbox, varid)
select min(data.tstime),max(data.tsendtime), data.featureid, data.entity_type, ST_Union(data.rast, 'SUM') as rast, data.bbox, daily.hydroid as varid
from dh_timeseries_weather as data
left outer join tmp_dh_timeseries_weather as dupe
on (
 data.tstime = dupe.tstime
 and data.tsendtime = dupe.tsendtime
 and data.varid = dupe.varid
 and data.bbox = dupe.bbox
)
left outer join dh_variabledefinition as hour
on(
  data.varid = hour.hydroid
)
left outer join dh_variabledefinition as daily
on(
  daily.varkey = :'dailyvarkeyID'
)
where dupe.tid is null
AND hour.varkey = :'hourlyvarkeyID'
and ( (data.tstime >=:start_epoch) or (:start_epoch = -1) )
and ( (data.tsendtime <=:end_epoch) or (:end_epoch = -1) )
GROUP BY data.featureid, data.entity_type, data.bbox, daily.hydroid,
extract(day from to_timestamp(data.tsendtime)), 
extract(year from to_timestamp(data.tsendtime)), 
extract(month from to_timestamp(data.tsendtime));"

# turn off the expansion of the asterisk
set -f
echo -e $raster_sql > $raster_sql_file 
cat $raster_sql_file | psql -h $db_host $db_name

Here is my updated sql with the hardset values, it doesn't work if it is a single day, I'm not sure why that would be, but when I choose two days it works, and creates 2 values. By saying less than or equal I would have thought it would return if it was a singular day. My thought is that when we do this we'll need hours included, and it will have to end with like '2021-05-02 01:00:00' or something along those lines. Here is what it's returning:

INSERT 0 2

insert into tmp_dh_timeseries_weather ( tstime,tsendtime, featureid, entity_type, rast, bbox, varid)
select min(data.tstime),max(data.tsendtime), data.featureid, data.entity_type, ST_Union(data.rast, 'SUM') as rast, data.bbox, daily.hydroid as varid
from dh_timeseries_weather as data
left outer join tmp_dh_timeseries_weather as dupe
on (
 data.tstime = dupe.tstime
 and data.tsendtime = dupe.tsendtime
 and data.varid = dupe.varid
 and data.bbox = dupe.bbox
)
left outer join dh_variabledefinition as hour
on(
  data.varid = hour.hydroid
)
left outer join dh_variabledefinition as daily
on(
  daily.varkey = 'nldas2_precip_daily'
)
where dupe.tid is null
AND hour.varkey = 'nldas2_obs_hourly'
and (data.tstime >= extract(epoch from ('2021-05-01'::timestamp)))
and (data.tsendtime <= extract(epoch from ('2021-05-02'::timestamp)))
GROUP BY data.featureid, data.entity_type, data.bbox, daily.hydroid,
extract(day from to_timestamp(data.tsendtime)), 
extract(year from to_timestamp(data.tsendtime)), 
extract(month from to_timestamp(data.tsendtime));

HARPgroup / model_meteorology

Workflow Step: `met` -> `04_hourly_fractions` #98