HARPgroup / model_meteorology

0 stars 0 forks source link

Workflow step `geo -> process -> 04_weekly_data`: Create R script to produce weekly mean value file #61

Open rburghol opened 2 months ago

rburghol commented 2 months ago
ilonah22 commented 1 month ago

I am having some trouble trying to run Rscript in the command line, I keep getting the error that the Rscript command is not found.

$ Rscript hydroimport_daily.R
bash: Rscript: command not found

I tried this both inside and outside of Harp archive directory, the Rscript I am trying to run is stored in the HARP 2024-2025 folder on my computer.

I also can't find any files or folders called Rscript on my computer, does that mean I don't have it and will have to download something else?

COBrogan commented 1 month ago

Hi @ilonah22 . This error indicates your computer can't find the bin folder associated with R. We will need to add it to your "PATH" Windows environmental variable. First, search for "Edit environmental variables for your account" on the Windows search bar: image Then, select "path" and click "edit": image Finally, click "New" and add the path to the "bin" directory under your R install folder. My example below has my path. My R is installed in AppData, but yours may be installed under your "Documents/" folder: image Let me know if you have any trouble and we can hop on a call to debug this. It's an annoying issue. Basically we are telling Windows where R is so it knows where to get functions like Rscript

ilonah22 commented 1 month ago

@COBrogan Thank you, I had some trouble finding the right path, but it seems to be working now.

ilonah22 commented 1 month ago

I made some changes to the daily value file I have been working on that incorporates the commandargs() we talked about. I didn't want to get too far on a weekly version until I got this one to work in the command line.

# Inputs (args):
# 1 = File path of csv from VA Hydro
# 2 = End path of new csv
args <- commandArgs(trailingOnly = TRUE)

# Pull csv from input file path
hydro_daily <- read.csv(args[1])

# Add in more date information
hydro_daily[,c('yr', 'mo', 'da', 'wk')] <- cbind(year(as.Date(hydro_daily$obs_date)),
                                                 month(as.Date(hydro_daily$obs_date)),
                                                 day(as.Date(hydro_daily$obs_date)),
                                                 week(as.Date(hydro_daily$obs_date)))

# If data comes from nladas2 (hourly), it must be converted into daily data
if (data_source=="nldas2"){
  hydro_daily <- sqldf(
    "select featureid, min(obs_date) as obs_date, yr, mo, da, 
     sum(precip_mm) as precip_mm, sum(precip_in) as precip_in
   from hydro_daily 
   group by yr, mo, da
   order by yr, mo, da
  "
  )}

# Write csv in new file path
write.csv(hydro_daily,args[2])

But, when I tried to run this in the command line I got a segmentation fault error.

$ Rscript hydroimport_daily.R "C:/Users/ilona/OneDrive - Virginia Tech/HARP/R Tests/usgs_ws_03176500-daymet-all.csv" "C:/Users/ilona/OneDrive - Virginia Tech/HARP/R Tests/Glen-RscriptTest.csv"
Segmentation fault

I'm not sure if it's a problem with the script or how I entered the inputs into the command line.

* I don't know if this matters, but I downloaded the csv from http://deq1.bse.vt.edu:81/met/daymet/out/ onto my computer because I thought that would have the best chance of working.

COBrogan commented 1 month ago

@ilonah22 Sorry for the delay in response, I somehow missed this yesterday afternoon. I was able to get this scrip to work but I had to make a few changes. First, there were a few things missing from the script. The library lubridate needs to be added in to the script due to the calls for lubridate::year(), lubridate::month(), etc. The if statement below is also looking for data_source. We should probably pass data_source in as an argument to the script: if (data_source=="nldas2"){ ... } to if (args[3] == "nldas2"){ ... }

Now, these errors did NOT reproduce the segmentation error you're seeing. My call was as follows. Are you passing in the correct path for the hydroimport_daily.R? Maybe try using an absolute path as I did below? Rscript c:/Users/gcw73279.COV/Desktop/testCommand.R "C:/Users/gcw73279.COV/Downloads/usgs_ws_01656000-daymet-all.csv" "C:/Users/gcw73279.COV/Desktop/testOut.csv" "prism"

It may help to add a print("Script started!") call at the beginning of hydroimport_daily.R and one right before the write.csv(). This would print a message to your console helping you figure out if the script is being called successfully and successfully reaches the write.csv() step. Segmentation fault errors are typically memory or access errors....

ilonah22 commented 1 month ago

We finished a first draft of make_weekly_summary_ts.R, which should be in the main branch now, but there were a few bits we had trouble with.

  1. [weekly_column(default=weekly_mean_value)] , We were a little confused about this aspect of the inputs, so we did not include it yet.
  2. start_date and end_date, the way we went about the weekly summary we assumed that the input file would be the comp_data, which does not have columns with those names, so the warning messages about these columns are commented out for now.

Other than those two issue, I was able to run it with a comp_data file that @mwdunlap2004 had already made, and it looks like it worked.

COBrogan commented 1 month ago

I think that the [weekly_column(default=weekly_mean_value)] is suggesting that the output dataset use this input as the name of the column. So, if we use Rob's example, Rscript.exe make_weekly_summary_ts.R source_filename output_daily_filename data_column weekly_average_value implies the output dataset should contain a column called "weekly_average_value" that contains the weekly means. If this argument is not provided, we should proceed with the default column name "weekly_mean_value e.g. in R:

#Set column name of output file using the third argument, if provided. Otherwise, default name to weekly_mean_value
colName <- args[3]
if (is.null(colName)) {
colName <- weekly_mean_value
}
mwdunlap2004 commented 1 month ago

I'm still confused about what the requirements mean and how to implement the column name and data_column aspects. For example, the comp_data csv we've created has 7 data columns, including usgs_cfs, dataset_p_in, and dataset_cfs. Which one do we want the weekly average for, or is it all of the columns? Could we meet at some point today to discuss how to go about this?

rburghol commented 1 month ago

Hey @mwdunlap2004 I am thinking that you probably know the answer to your above question, but yes, all data columns will be averaged.