UW-Hydro / tonic

A pre/post processing toolbox for hydrologic models
MIT License
20 stars 33 forks source link

4-d Variable Output Using Standard Memory Mode #24

Open anewman89 opened 8 years ago

anewman89 commented 8 years ago

I used the "standard" option and ran into an issue. If I tried to include soil moisture (4-d variable) in the configuration file I got the following error when the code tries to write to the netcdf files after it loads all the files in the current chunk:

Traceback (most recent call last):
  File "/glade/u/home/anewman/bin/vic_utils", line 5, in <module>
    pkg_resources.run_script('tonic==0.0.0.dev-2bf5167', 'vic_utils')
  File "/glade/apps/opt/python/2.7.7/gnu-westmere/4.8.2/lib/python2.7/site-packages/pkg_resources.py", line 534, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/glade/apps/opt/python/2.7.7/gnu-westmere/4.8.2/lib/python2.7/site-packages/pkg_resources.py", line 1441, in run_script
    exec(script_code, namespace, namespace)
  File "/glade/u/home/anewman/lib/python2.7/site-packages/tonic-0.0.0.dev_2bf5167-py2.7.egg/EGG-INFO/scripts/vic_utils", line 221, in <module>

  File "/glade/u/home/anewman/lib/python2.7/site-packages/tonic-0.0.0.dev_2bf5167-py2.7.egg/EGG-INFO/scripts/vic_utils", line 197, in main

  File "build/bdist.linux-x86_64/egg/tonic/models/vic/vic2netcdf.py", line 546, in _run
  File "build/bdist.linux-x86_64/egg/tonic/models/vic/vic2netcdf.py", line 896, in vic2nc
  File "build/bdist.linux-x86_64/egg/tonic/models/vic/vic2netcdf.py", line 459, in nc_add_data_standard
  File "netCDF4.pyx", line 3267, in netCDF4.Variable.__setitem__ (netCDF4.c:39658)
ValueError: total size of new array must be unchanged

I traced it back to line ~448:

self.f.variables[name][:, i, ys, xs] 

is looking for something that is 2-dimensional while

p.df[sn].values[self.slice]

is only 1 dimensional with a length set at the number of time steps going into the current netcdf file. If I removed soil moisture in the configuration file, I got this option to output properly, so it was an issue with 4-d variables.

I then made some modifications to the code and got it to work for 4-d variables. This is really my first halfway serious go with python, so my syntactical understanding is limited, lots of potential for me to have messed the fix up in some fashion.

I ran the code a bunch and it worked fine. It seemed a little slow, but there is lots of I/O both in and out so I didn't think much of it. Then I got an email from our supercomputer system administration folks stating that my code was performing an excessive amount of disk writes to the same location. They reported that the read rates were fine, but the output was many times the input. That makes me think I fixed the code in an improper fashion so the netcdf writes are occurring an excessive number of times...

The changes are in the function: nc_add_data_standard. What is the best way for me to post my "fixed" code?

Cheers, Andy

jhamman commented 8 years ago

Sounds like we have two issues here:

  1. Standard mode slice bug: I think we can fix this for the 4-d var. It sounds like a pretty simple fix that you may have already applied. This would be worth issuing a pull request against develop for.
  2. Your sys admin is right, the standard mode makes a lot of writes. You could try the big_memory mode or the original mode and see if that helps. big_memory will be the fastest but, as you may glean from its name, it uses the most memory. This mode reads and writes each file only once.
anewman89 commented 8 years ago

Hi Joe, I've pushed to my fork. It looks like I edited both the 3-d and 4-d output for the nc_add_data_standard function. I can go ahead and issue the pull request.

anewman89 commented 8 years ago

On point 2: Right, the standard mode would write after each chunk is read in. Does it work like this:

  1. Define the netcdf file with the full grid dimensions
  2. Issue write commands to fill the portions of the grid as they are read in. Something like netcdf_put_vara_* would be used for each variable write.

I would think the total data writes would still be roughly equal to the total data read... I was getting something on the order of 10x data being written than read.

jhamman commented 8 years ago

I would think the total data writes would still be roughly equal to the total data read... I was getting something on the order of 10x data being written than read.

It probably depends how you chunk you dataset up.

Issue write commands to fill the portions of the grid as they are read in. Something like netcdf_putvara* would be used for each variable write.

yes, but the Python API doesn't use that syntax exactly.