USGS-R / river-dl

Deep learning model for predicting environmental variables on river systems
Creative Commons Zero v1.0 Universal
21 stars 15 forks source link

Shared project folder on Caldera #115

Closed SimonTopp closed 2 years ago

SimonTopp commented 3 years ago

Seems like we might want a shared project folder with the various input data associated with river-dl so that we're not hosting it across all our personal directories. This would also limit the amount of modifications each individual user would need to make in the config.yaml to get the project running. @jzwart does your /caldera/projects/usgs/water/iidd/datasci/water-prediction/run-pgdl-da already have it (flow, temp, distance matrix, sntemp output)? If so, would it make sense to use that project folder?

Alternatively, do we want to consider adding another rule to the snakemake that pulls the data directly from the DRB_data_prep repository? I can see that being beneficial for reproducibility, but it might also result in the same duplication of the munged data across local directories that we have now.

jzwart commented 3 years ago

/caldera/projects/usgs/water/iidd/datasci/water-prediction/run-pgdl-da does have some of that data on there, but will pull the data from sciencebase items - I've also been using a different config.yml for that project since there are a lot of forecast-specific stuff in there so it doesn't exactly match with the river-dl config.yml. I think pulling data from the sciencebase items (e.g. https://www.sciencebase.gov/catalog/item/5f6a287382ce38aaa2449131) within the snakemake file would be the way to go, or this could be in a targets file since there is an R package (sbtools) for pulling /pushing to sciencebase.

I have started a river-dl repo on caldera at /caldera/projects/usgs/water/iidd/datasci/water-prediction/river-dl so data could go there as well.

SimonTopp commented 3 years ago

Pulling from sciencebase within the snakemake makes sense to me. This seems like a good item to put onto the agenda for our big picture strategy meeting tomorrow (6/21).

janetrbarclay commented 3 years ago

In case it's useful for this, there is a python package for pulling data from Sciencebase.

SimonTopp commented 3 years ago

@janetrbarclay, @jzwart , @jsadler2 I'm about to merge Jeff's PR where we discuss having a shared folder (but don't resolve it). Just wanted to pick this thread up again and hopefully find a solution so we can close it. It seems like we all agree that in some repository (here or in DRB data prep) we should have a pipeline that pulls the latest data from Sciencebase and then saves it to a shared folder on Caldera. Jeff, I'm guessing you already have the code to pull from SB and munge it into the river-dl format?

SimonTopp commented 2 years ago

This issue was resolved with the river-dl data prep repo.