Add support for parallel multi-watershed analysis (8)

huard commented 6 years ago

How do we use parallelism to do this ?

huard commented 5 years ago

@julemai Could you give us an example of configuration files that would run the same model on multiple watersheds (each with a different parameter set) but with common netCDF input files?

huard commented 5 years ago

Performance wise, is there a large difference between running the model once with n watersheds or running it n times with one watershed each time?

richardarsenault commented 5 years ago

The difference is negligible. The total computing power required is the same. However, from an algorithmic point of view, you always want to run the optimization n times with one watershed at a time, and parallelize the watersheds.

julemai commented 5 years ago

It could be that read-in of data is more efficient when we setup Raven once for multiple watersheds. I’m assuming that this only makes a difference for large watersheds. If you get me an example, I could set it up and we would have some numbers.

richardarsenault commented 5 years ago

I agree for the case of large models, but in our case (for now) we only have lumped models so in theory the differences are pretty insignificant (maybe larger float values...?) :)

julemai commented 5 years ago

@richardarsenault Yep, that would be my bet too for lumped models. Let me know if you want to have it confirmed.

richardarsenault commented 5 years ago

We can test it, but I have performed enough calibrations to be pretty sure of this answer, and I am pretty sure you know it even more. I think for David's purposes this answer is sufficient. Thanks!

huard commented 5 years ago

Correct, thanks !

richardarsenault commented 5 years ago

In Matlab I use a parpool and loop over each watershed, not sure how to do this in the PAVICS architecture. Essentially, I think this tool would be used more upstream in the workflow, for example if you want to run a climate model on a bunch of catchments, you have to feed a bunch of lat/long coordinates and/or basin contours and give a climate run, along with parameter sets or Qobs for calibration, and then the code would do the rest. I think this would have to be limited to simple tasks (i.e a single climate model over the entire domain defined by the user.

huard commented 5 years ago

Le's recap. We want to be able to run one model over a number of different basins in a way that's simple to the user. Raven (C++) does not support this, so we need to loop within the python interface, as we do for multiple parameters. Here however, it's not only the parameters that change. All of these inputs must be provided, one for each basin:

model parameters
basin information (name, area, elevation, latitude, longitude)
meteorological inputs

I fear this could get hairy pretty fast unless we restrict the scope of what is possible to do. My initial suggestion for discussion would be that to trigger multi-basin simulation, the user would pass a vector of parameters and minimally, a matching vector of region_ind. We would assume that netCDF variables have two dimensions (region, time), and the region_ind value would indicate the index (starting at 0) of the times series for each region. For the region attributes (names, area, elevation, lat, lon), the user would either pass them explicitly as vectors in the call, or let them set to their default (None). The code would then look into the netCDF forcing files if it can find variables with matching names (name, area, elevation, latitude/lat, longitude/lon) and would set them automatically.

richardarsenault commented 5 years ago

1- Yes, we need to loop within the python interface. 2- Correct, the model parameters, basin info and meteo inputs are required for each catchment. Eventually we might want to calibrate multiple basins in parallel, in which case the model parameters would not be required. 3- I think your suggestion is a great idea. Ideally the variables would be in the NetCDF file, stored as vectors of size [region_ind x 1] for each property (lat, long, area, elevation, etc.). Maybe there are some cases where we will want to pass those properties explicitly (not in NetCDF) but for now I fail to see them. 4- If we look at the bigger picture, imagine someone who has, for example, 5 hydro stations where they'd like to calibrate a hydrological model and have an estimate of simulated streamflow for forecasting purposes / climate change / whatever. They could simply pass the station coordinates to the GIS analysis tools to get the basin properties, then the code could extract a reference meteo dataset intersecting the catchment boundaries, send that to OSTRICH for calibration, then return the optimized parameter sets. Would want the whole thing to be done in parallel? If so, we can force the code to write the properties, meteo and parameters to a NetCDF that gets updated incrementally as the process advances. If, instead, the user calls this function directly to run simulations in parallel, then we can probably simply require them to add the basin info to the NetCDF. All this to say that I don't think we need to give the option to send the basin properties explicitly, but we can do a check to make sure they are present (and in the right format) in the NetCDF.

huard commented 5 years ago

Ok. @richardarsenault I suggest you open an issue for "Input dataset preparation", that would help with the task of going from gridded climate variables to basin-specific time series in a raven-compatible format.

julemai commented 5 years ago

RAVEN can technically be setup to run multiple basins in parallel. You can provide different sets of parameters for the different basins if the parameters are basin-specific and not process-specific. the process description would need to be the same for all simulated basins.

Hence, I think, the handling of running only individual basins with RAVEN and implementing the loop over the basins in Python is way more elegant. :)

The NetCDF inputs (forcings) can be two-dimensional (stations x time or time x stations). One would need to specify the :StationIdx in the data block (one for each forcing). The block looks like this:

:Data [forcing type] [unit]
   :ReadFromNetCDF
      :FileNameNC      [path/filename of .nc file]
      :VarNameNC       [name of variable in .nc file]
      :DimNamesNC      [stations_name] [time_name] | [time_name]
      :StationIdx      [ID of station of interest (starts with 1)]
      :TimeShift       [fractional day to shift time stamp of data]
      :LinearTransform [slope] [intercept]
   :EndReadFromNetCDF
:EndData

More information can be found in the current RAVEN manual for v2.9 here (see page 151ff).

huard commented 5 years ago

Hi @julemai,

I've created a 2D file for testing this and I'm having a weird core dump with Raven. The unit_t variable (time units) is causing an error on Line 1324 of TimeSeries.cpp. There is no colon in the time units, it's just days since 1954-01-01. When I print unit_t_str, I get days since 1954-01-01U, and I guess that U is causing problems. Ideas ?

huard commented 5 years ago

Is it possible that you add this U to identify unicode strings ? Since I've created this file using Python3, it's possible it converted the original ascii string to unicode.

huard commented 5 years ago

You can test it using branch fix-12 and running the test_parallel_basins in test_emulators.py

julemai commented 5 years ago

@huard I tried to run make test in this branch. This creates the following error for each test:

    import fiona
E   ModuleNotFoundError: No module named 'fiona'

I tried to run another make install. But that one gives me the error:

[j6mai@JulieUW:/Users/j6mai/Documents/GitHub/raven/]:
make install
Installing Anaconda ...
Updating conda environment raven ...
"/bin/conda" create --yes -n raven python=3.6
/bin/sh: /bin/conda: No such file or directory
make: *** [conda_env] Error 127

Any idea?

julemai commented 5 years ago

I assumed in Raven that the time unit string is following the pattern:

days since YYYY-MM-DD HH:MM:SS

You could try to add HH:MM:SS in your unit and it should work. I will work on fixing this in RAVEN assuming that if time is not given it is midnight.

julemai commented 5 years ago

New Raven version (rev183) is available here. It will throw an error if the time string is not YYYY-MM-DD HH:MM:SS.

huard commented 5 years ago

Thanks, you're right. I had not noticed that when I created the new file, the 00:00:00 disappeared.

Ouranosinc / raven

Add support for parallel multi-watershed analysis (8) #12