HARPgroup / model_meteorology

0 stars 0 forks source link

Workflow Step: `met` -> `04_hourly_fractions` #98

Open mwdunlap2004 opened 2 weeks ago

mwdunlap2004 commented 2 weeks ago

Check method to change this workflow into daily data, without temp tables, that inserts into dh_timerseries_weather using methods from calc_raster_ts and timestamp methods

We want to take the work from #86 and convert it into something more similar to raster_calc:

Our Delete function for daily_precip

dailyvarkeyID=$1
start_date=-1
end_date=-1

if [ $# -gt 1 ]; then
  start_date=$2
fi

if [ $# -gt 2 ]; then
  end_date=$3
fi

if [ "$start_date" == "-1" ]; then
  start_epoch=-1
else
  start_epoch=$(date -d "$start_date" +%s)
fi

if [ "$end_date" == "-1" ]; then
  end_epoch=-1
else
  end_epoch=$(date -d "$end_date" +%s)
fi
deletedaily="
\\set dailyvarkeyID '$dailyvarkeyID'
\\set start_epoch $start_epoch
\\set end_epoch $end_epoch

delete from tmp_dh_timeseries_weather as data
using dh_variabledefinition as variables
where data.varid = variables.hydroid
  and variables.varkey = :'dailyvarkeyID'
  and ( (data.tstime >= :start_epoch) or (:start_epoch = -1) )
  and ( (data.tsendtime <= :end_epoch) or (:end_epoch = -1) )
;"
# turn off the expansion of the asterisk
set -f
echo -e $deletedaily > $delete_daily_file
cat $delete_daily_file | psql -h $db_host $db_name

Our Insert function for daily_precip

hourlyvarkeyID=$1
dailyvarkeyID=$2
start_date =-1
end_date =-1
if [ $# -gt 2 ]; then
  start_date=$3
fi
if [ $# -gt 3 ]; then
  end_date=$4
fi
if [ "$start_date" == "-1" ]; then
  start_epoch='-1'
else
  start_epoch=$(date -d "$start_date" +%s)
fi
if [ "$end_date" == "-1" ]; then
 end_epoch='-1' # Or leave it undefined
else
  end_epoch=$(date -d "$end_date" +%s)
fi
rastersql="
\\set hourlyvarkeyID '$hourlyvarkeyID' \n
\\set dailyvarkeyID '$dailyvarkeyID' \n
\\set start_epoch $start_epoch
\\set end_epoch $end_epoch

insert into tmp_dh_timeseries_weather ( tstime,tsendtime, featureid, entity_type, rast, bbox, varid)
select min(data.tstime),max(data.tsendtime), data.featureid, data.entity_type, ST_Union(data.rast, 'SUM') as rast, data.bbox, daily.hydroid as varid
from dh_timeseries_weather as data
left outer join tmp_dh_timeseries_weather as dupe
on (
 data.tstime = dupe.tstime
 and data.tsendtime = dupe.tsendtime
 and data.varid = dupe.varid
 and data.bbox = dupe.bbox
)
left outer join dh_variabledefinition as hour
on(
  data.varid = hour.hydroid
)
left outer join dh_variabledefinition as daily
on(
  daily.varkey = :'dailyvarkeyID'
)
where dupe.tid is null
AND hour.varkey = :'hourlyvarkeyID'
and ( (data.tstime >=:start_epoch) or (:start_epoch = -1) )
and ( (data.tsendtime <=:end_epoch) or (:end_epoch = -1) )
GROUP BY data.featureid, data.entity_type, data.bbox, daily.hydroid,
extract(day from to_timestamp(data.tsendtime)), 
extract(year from to_timestamp(data.tsendtime)), 
extract(month from to_timestamp(data.tsendtime));"
# turn off the expansion of the asterisk
set -f
echo -e $raster_sql > $raster_sql_file 
cat $raster_sql_file | psql -h $db_host $db_name
mwdunlap2004 commented 6 days ago

Rob I was thinking of doing both steps in the same workflow, but I think once we have the process completed it could easily be separated. This is my take on our step to create the daily precipitation data in dh_timeseriesweather, it isn't working right now, as it isn't inserting any data, I'm thinking this is related to the varid, in this step, we simultaneously need to make sure our varkey is the hourly data, and that we change it to the daily varid, how do we do that? I did check and there is a hydroid for nldas2_precip_daily, its 1464, I'm just unsure on how to handle us having two different values in this step. I don't think calc_raster has that step, they join on the hydroid, but they don't update it in that step.

varkey=$1 \set varkey '$varkey' \n end_epoch = $2

How to make our epoch

select extract(epoch from ('2021-05-01'::timestamp));
insert into tmp_dh_timeseries_weather ( tstime,tsendtime, featureid, entity_type, rast, bbox, varid)
select min(a.tstime),max(a.tsendtime), a.featureid, a.entity_type, ST_Union(a.rast, 'SUM') as rast, a.bbox, b.hydroid as varid
from dh_timeseries_weather as a
left outer join tmp_dh_timeseries_weather as dupe
on (
 a.tstime = dupe.tstime
 and a.tsendtime = dupe.tsendtime
 and a.varid = dupe.varid
 and a.bbox = dupe.bbox
)
left outer join dh_variabledefinition as b 
on(
  a.varid = b.hydroid AND
  b.varkey = 'varkey'
)
where dupe.tid is null
AND b.varkey = 'nldas2_obs_hourly'
AND extract(date from to_timestamp(a.tsendtime)) = extract(date from to_timestamp(end_epoch))
GROUP BY a.featureid, a.entity_type, a.bbox, b.hydroid,
extract(day from to_timestamp(a.tsendtime)), 
extract(year from to_timestamp(a.tsendtime)), 
extract(month from to_timestamp(a.tsendtime));
COBrogan commented 5 days ago

@mwdunlap2004 I think there are a few issues here with your psql variables. First, I'm a little confused by these lines:

varkey=$1
\set varkey '$varkey' \n
end_epoch = $2

Are arguments $1 and $2 coming in from the command line? If so, I'd suggest declaring these first and then using \set (basically, switch lines 2 and 3). For best practice, let's be more specific in our declaration of varkey too. We should try to use the brackets notation with string interpolation, though it's not really necessary here (and thus we are not always consistent):

varkey=$1
end_epoch = $2
\set varkey '${varkey}' \n

One reason your query is not working, is the following:

left outer join dh_variabledefinition as b 
on(
  a.varid = b.hydroid AND
  b.varkey = 'varkey'
)

This is literally joining on varkey = 'varkey', so it's looking for all entries where varkey is equal to the word 'varkey'. So not what we want. Instead, you need to call your SQL variable using the colon notation:

left outer join dh_variabledefinition as b 
on(
  a.varid = b.hydroid AND
  b.varkey = :'varkey'
)

As for setting the daily varkey, look at what you are sending to INSERT. Rather than inserting b.hydroid, send a string variable that you need instead e.g. 'dailyvarkeyID' as varid

insert into tmp_dh_timeseries_weather ( tstime,tsendtime, featureid, entity_type, rast, bbox, varid)
select min(a.tstime),max(a.tsendtime), a.featureid, a.entity_type, ST_Union(a.rast, 'SUM') as rast, a.bbox, b.hydroid as varid
rburghol commented 5 days ago

@COBrogan great observations and points of style. I would add:

I think you definitely need to separate the creation of the daily total and the hourly fraction into separate scripts. This will simplify debugging, increase readability and also allow us to utilize the hourly-to-daily separately. If you don't separate them you will now need to pass in 3 varkeys as arguments: the hourly data varkey, the hourly fraction varkey and the daily varkey -- another reason that splitting simplifies things.

mwdunlap2004 commented 5 days ago

Based on your suggestions here is my change to the code to insert our daily rasters into dh_timeseries_weather, however I do have a few questions about how this is working. First in this example we set dailyvarkeyID as varid, does this mean we'd want this to be a number '1464' in this case, as opposed to the character string I had used for our fraction raster step previously. What I mean is that in the past we had used a left join to update the records in our table using this new hydroid based on the dh_variabledefintion table and a new varkey.

Second, I am working under the assumption that we would want our epoch to be used as an argument into this function, but would we rather create it when we set the variables? This would mean the argument is the date and then I could do something like:

\set epoch '${extract(epoch from ('date'::timestamp))}' \n

I'm not sure if this would work, but that brings me to my third question, which is how do I declare these variables in my environment? I tried this to set our date, but I'm not sure how to check if it worked, or how to call my variable once I've done this.

Do $$ Declare data_date DATE; BEGIN data_date = '2021-05-01';
END $$;

hourlyvarkeyID=$1 dailyvarkeyID=$2 end_epoch = $3


rastersql="
\\set hourlyvarkeyID '${hourlyvarkeyID}' \n
\\set dailyvarkeyID '${dailyvarkeyID}' \n

insert into tmp_dh_timeseries_weather ( tstime,tsendtime, featureid, entity_type, rast, bbox, varid)
select min(a.tstime),max(a.tsendtime), a.featureid, a.entity_type, ST_Union(a.rast, 'SUM') as rast, a.bbox, :'dailyvarkeyID' as varid
from dh_timeseries_weather as a
left outer join tmp_dh_timeseries_weather as dupe
on (
 a.tstime = dupe.tstime
 and a.tsendtime = dupe.tsendtime
 and a.varid = dupe.varid
 and a.bbox = dupe.bbox
)
left outer join dh_variabledefinition as b 
on(
  a.varid = b.hydroid
)
where dupe.tid is null
AND b.varkey =: 'hourlyvarkeyID'
AND extract(day from to_timestamp(a.tsendtime)) = extract(day from to_timestamp(end_epoch))
AND extract(year from to_timestamp(a.tsendtime)) = extract(year from to_timestamp(end_epoch))
AND extract(month from to_timestamp(a.tsendtime)) = extract(month from to_timestamp(end_epoch))
GROUP BY a.featureid, a.entity_type, a.bbox, b.hydroid,
extract(day from to_timestamp(a.tsendtime)), 
extract(year from to_timestamp(a.tsendtime)), 
extract(month from to_timestamp(a.tsendtime));"
mwdunlap2004 commented 3 days ago

This is my updated code, it uses start and end_date and converts them into epoch, or sets them to -1 if we don't give them a date, it uses our defined hourly and varkeyID,uses two left outer joins it also does the time matching based on our epoch. However, how do I run this code to see if it works? Because I tried to set the variables in SQL, and it said that ' \ ' was an invalid command, is it because this is really a bash command not an SQL one?

hourlyvarkeyID=$1
dailyvarkeyID=$2
start_date =-1
end_date =-1
if [ $# -gt 2 ]; then
  start_date=$3
fi
if [ $# -gt 3 ]; then
  end_date=$4
fi
if [ "$start_date" == "-1" ]; then
  start_epoch='-1'
else
  start_epoch=$(date -d "$start_date" +%s)
fi
if [ "$end_date" == "-1" ]; then
 end_epoch='-1' # Or leave it undefined
else
  end_epoch=$(date -d "$end_date" +%s)
fi
rastersql="
\\set hourlyvarkeyID '$hourlyvarkeyID' \n
\\set dailyvarkeyID '$dailyvarkeyID' \n
\\set start_epoch $start_epoch
\\set end_epoch $end_epoch

insert into tmp_dh_timeseries_weather ( tstime,tsendtime, featureid, entity_type, rast, bbox, varid)
select min(data.tstime),max(data.tsendtime), data.featureid, data.entity_type, ST_Union(data.rast, 'SUM') as rast, data.bbox, daily.hydroid as varid
from dh_timeseries_weather as data
left outer join tmp_dh_timeseries_weather as dupe
on (
 data.tstime = dupe.tstime
 and data.tsendtime = dupe.tsendtime
 and data.varid = dupe.varid
 and data.bbox = dupe.bbox
)
left outer join dh_variabledefinition as hour
on(
  data.varid = hour.hydroid
)
left outer join dh_variabledefinition as daily
on(
  daily.varkey = :'dailyvarkeyID'
)
where dupe.tid is null
AND hour.varkey = :'hourlyvarkeyID'
and ( (data.tstime >= $start_epoch) or ($start_epoch = -1) )
and ( (data.tsendtime <= $end_epoch) or ($end_epoch = -1) )
GROUP BY data.featureid, data.entity_type, data.bbox, daily.hydroid,
extract(day from to_timestamp(data.tsendtime)), 
extract(year from to_timestamp(data.tsendtime)), 
extract(month from to_timestamp(data.tsendtime));"
mwdunlap2004 commented 3 days ago

Here is my attempt at making the delete statement: It takes in similar requests, using the varkeyID and the start and end dates to create our epochs and then deleting the rows in our data that match those values, it also has the default of -1 to delete all the rows, so that if you want all the data for the daily table, it will delete all the rows.

dailyvarkeyID=$1
start_date=-1
end_date=-1

if [ $# -gt 1 ]; then
  start_date=$2
fi

if [ $# -gt 2 ]; then
  end_date=$3
fi

if [ "$start_date" == "-1" ]; then
  start_epoch=-1
else
  start_epoch=$(date -d "$start_date" +%s)
fi

if [ "$end_date" == "-1" ]; then
  end_epoch=-1
else
  end_epoch=$(date -d "$end_date" +%s)
fi
deletedaily="
\\set dailyvarkeyID '$dailyvarkeyID'
\\set start_epoch $start_epoch
\\set end_epoch $end_epoch

delete from tmp_dh_timeseries_weather as data
using dh_variabledefinition as variables
where data.varid = variables.hydroid
  and variables.varkey = :'dailyvarkeyID'
  and ( (data.tstime >= :start_epoch) or (:start_epoch = -1) )
  and ( (data.tsendtime <= :end_epoch) or (:end_epoch = -1) )
;"
COBrogan commented 3 days ago

@mwdunlap2004 Not sure if this answers your question about set but you only need one slash when running the command manually e.g.

\set testVar
select :testVar

However, when you are writing the Bash command, the backslash has to be escaped so that you need \\set and each has to be on its own line (and thus needs to end with \n). Notice the \n at the end of your first set. In your INSERT query, your start_epoch and end_epoch need to be called using the colon notation in your WHERE statement e.g.

and ( (data.tstime >= :start_epoch) or (:start_epoch = -1) )
and ( (data.tsendtime <= :end_epoch) or (:end_epoch = -1) )

Are you running this line-by-line, or writing the query to file via echo/cat? e.g.

# turn off the expansion of the asterisk
set -f
echo -e $raster_sql > $raster_sql_file 
cat $raster_sql_file | psql -h $db_host $db_name
mwdunlap2004 commented 1 day ago

@COBrogan @rburghol So more along the lines of this? I also don't have quotes around the epoch's right now, is that something I need because I assigned them? I was trying to follow the formatting of calc_raster_ts, but is that the best way to test to see if it works? or is there a different method I can use to run the function? Because if I follow the calc_raster_ts method don't we have to make a config file for it?

hourlyvarkeyID=$1
dailyvarkeyID=$2
start_date =-1
end_date =-1
if [ $# -gt 2 ]; then
  start_date=$3
fi
if [ $# -gt 3 ]; then
  end_date=$4
fi
if [ "$start_date" == "-1" ]; then
  start_epoch='-1'
else
  start_epoch=$(date -d "$start_date" +%s)
fi
if [ "$end_date" == "-1" ]; then
 end_epoch='-1' # Or leave it undefined
else
  end_epoch=$(date -d "$end_date" +%s)
fi
rastersql="
\\set hourlyvarkeyID '$hourlyvarkeyID' \n
\\set dailyvarkeyID '$dailyvarkeyID' \n
\\set start_epoch $start_epoch
\\set end_epoch $end_epoch

insert into tmp_dh_timeseries_weather ( tstime,tsendtime, featureid, entity_type, rast, bbox, varid)
select min(data.tstime),max(data.tsendtime), data.featureid, data.entity_type, ST_Union(data.rast, 'SUM') as rast, data.bbox, daily.hydroid as varid
from dh_timeseries_weather as data
left outer join tmp_dh_timeseries_weather as dupe
on (
 data.tstime = dupe.tstime
 and data.tsendtime = dupe.tsendtime
 and data.varid = dupe.varid
 and data.bbox = dupe.bbox
)
left outer join dh_variabledefinition as hour
on(
  data.varid = hour.hydroid
)
left outer join dh_variabledefinition as daily
on(
  daily.varkey = :'dailyvarkeyID'
)
where dupe.tid is null
AND hour.varkey = :'hourlyvarkeyID'
and ( (data.tstime >=:start_epoch) or (:start_epoch = -1) )
and ( (data.tsendtime <=:end_epoch) or (:end_epoch = -1) )
GROUP BY data.featureid, data.entity_type, data.bbox, daily.hydroid,
extract(day from to_timestamp(data.tsendtime)), 
extract(year from to_timestamp(data.tsendtime)), 
extract(month from to_timestamp(data.tsendtime));"
# turn off the expansion of the asterisk
set -f
echo -e $raster_sql > $raster_sql_file 
cat $raster_sql_file | psql -h $db_host $db_name
mwdunlap2004 commented 1 day ago

Here is my updated sql with the hardset values, it doesn't work if it is a single day, I'm not sure why that would be, but when I choose two days it works, and creates 2 values. By saying less than or equal I would have thought it would return if it was a singular day. My thought is that when we do this we'll need hours included, and it will have to end with like '2021-05-02 01:00:00' or something along those lines. Here is what it's returning:

INSERT 0 2
insert into tmp_dh_timeseries_weather ( tstime,tsendtime, featureid, entity_type, rast, bbox, varid)
select min(data.tstime),max(data.tsendtime), data.featureid, data.entity_type, ST_Union(data.rast, 'SUM') as rast, data.bbox, daily.hydroid as varid
from dh_timeseries_weather as data
left outer join tmp_dh_timeseries_weather as dupe
on (
 data.tstime = dupe.tstime
 and data.tsendtime = dupe.tsendtime
 and data.varid = dupe.varid
 and data.bbox = dupe.bbox
)
left outer join dh_variabledefinition as hour
on(
  data.varid = hour.hydroid
)
left outer join dh_variabledefinition as daily
on(
  daily.varkey = 'nldas2_precip_daily'
)
where dupe.tid is null
AND hour.varkey = 'nldas2_obs_hourly'
and (data.tstime >= extract(epoch from ('2021-05-01'::timestamp)))
and (data.tsendtime <= extract(epoch from ('2021-05-02'::timestamp)))
GROUP BY data.featureid, data.entity_type, data.bbox, daily.hydroid,
extract(day from to_timestamp(data.tsendtime)), 
extract(year from to_timestamp(data.tsendtime)), 
extract(month from to_timestamp(data.tsendtime));