Closed SimonTopp closed 2 years ago
/caldera/projects/usgs/water/iidd/datasci/water-prediction/run-pgdl-da
does have some of that data on there, but will pull the data from sciencebase items - I've also been using a different config.yml
for that project since there are a lot of forecast-specific stuff in there so it doesn't exactly match with the river-dl config.yml
. I think pulling data from the sciencebase items (e.g. https://www.sciencebase.gov/catalog/item/5f6a287382ce38aaa2449131) within the snakemake file would be the way to go, or this could be in a targets file since there is an R package (sbtools) for pulling /pushing to sciencebase.
I have started a river-dl repo on caldera at /caldera/projects/usgs/water/iidd/datasci/water-prediction/river-dl
so data could go there as well.
Pulling from sciencebase within the snakemake makes sense to me. This seems like a good item to put onto the agenda for our big picture strategy meeting tomorrow (6/21).
In case it's useful for this, there is a python package for pulling data from Sciencebase.
@janetrbarclay, @jzwart , @jsadler2 I'm about to merge Jeff's PR where we discuss having a shared folder (but don't resolve it). Just wanted to pick this thread up again and hopefully find a solution so we can close it. It seems like we all agree that in some repository (here or in DRB data prep) we should have a pipeline that pulls the latest data from Sciencebase and then saves it to a shared folder on Caldera. Jeff, I'm guessing you already have the code to pull from SB and munge it into the river-dl format?
This issue was resolved with the river-dl data prep repo.
Seems like we might want a shared project folder with the various input data associated with river-dl so that we're not hosting it across all our personal directories. This would also limit the amount of modifications each individual user would need to make in the
config.yaml
to get the project running. @jzwart does your/caldera/projects/usgs/water/iidd/datasci/water-prediction/run-pgdl-da
already have it (flow, temp, distance matrix, sntemp output)? If so, would it make sense to use that project folder?Alternatively, do we want to consider adding another rule to the snakemake that pulls the data directly from the DRB_data_prep repository? I can see that being beneficial for reproducibility, but it might also result in the same duplication of the munged data across local directories that we have now.