Option to delete original model outputs after converting to netcdf

rykelly commented 9 years ago

Per an offline discussion with @mdietze and @ankurdesai , it would be good to have the space-saving option to delete the raw model outputs after they've been converted to netcdf. For example, if running thousands of iterations of data assimilation. Seems like each model2netcdf.MODEL function would need to take care of this in its own way, either by actually deleting output files when finished, or by returning a list of those files so that a generic external function could do it.

I can work on this but thought I'd solicit feedback first.

dlebauer commented 9 years ago

My thought: not only do the files take up space, but writing them consumes both time and resources (bandwidth). Where possible, modelers and model2netcdf.* authors can consider configuring model runs to only write what will be needed and even adding an option to writing directly to netcdf.

Is there any role or plans for Brown Dog to play a role in the model2netcdf conversions?

mdietze commented 9 years ago

I agree that everything would be more efficient if models just wrote out their output in the PEcAn standard, but there's no way to enforce that and realistically most teams won't do it (case in point, despite using ED2 and SIPNET for years, we've never rewritten their outputs to be in netCDF).

What I told Ryan in discussing this with him is that the simplest way to do this was to just build deleting the original model output into the job.sh script. That said, I think that behavior should be up to the user who implements each model package, not a required behavior. For a model like SIPNET, where the output is simple and similar to the netCDF in content, there's really no information loss in deleting the original, but that's not true for ED2 where there's a ton of site, patch, and cohort level information that's lost if you delete the hdf5 files and only retain the PEcAn netCDF.

Right now there are no plans to use Brown Dog in model2netcdf since it's not a required function anymore. That said, there's nothing keeping individual modeling teams from using Brown Dog if they want to, though it would just exacerbate the bandwidth problem.

ankurdesai commented 9 years ago

Another option to consider is to allow model output be written to “ramdisk” - essentially a temporary file system that lives in memory. Ramdisks were all the rage when disk i/o was all floppy drive based. Then all i/o would be to RAM, and when the “disk" was unmounted or system restarted, it would disappear (MS-DOS has many ramdisk options). Linux actually makes that really easy by mounting a drive with the tmpfs system. Ramdisks went away when intelligent disk caching, fast+large hard drives, and SSD became more common. But for this frequent potentially temporary read/write, I wonder if a speed up and space saving might be worth it? Linux also does named pipes, which is an easy way to send data between two processes without disk I/O. Named pipes actually look like files on the file system (they show up on ls within directories, with a “p” attribute set in the read/write bits), but only exist until the pipe sent from process 1 is read by process 2, during which time process 1 is blocked (hangs). Anyway, future stuff really. For the most part, when I run the VM on my SSD based laptop, the disk I/O overhead seems minimal. R’s interpreter (or is it a just-in-time compiler?) itself may be the slow poke. -ankur

Ankur R Desai, Associate Professor University of Wisconsin - Madison, Atmospheric and Oceanic Sciences http://flux.aos.wisc.edu http://flux.aos.wisc.edu/ desai@aos.wisc.edu mailto:desai@aos.wisc.edu O: +1-608-520-0305 / M: +1-608-218-4208

On Jul 1, 2015, at 4:41 PM, David LeBauer notifications@github.com wrote:

My thought: not only do the files take up space, but writing them consumes both time and resources (bandwidth). Where possible, modelers and model2netcdf.* authors can consider configuring model runs to only write what will be needed and even adding an option to writing directly to netcdf.

Is there any role or plans for Brown Dog to play a role in the model2netcdf conversions?

— Reply to this email directly or view it on GitHub https://github.com/PecanProject/pecan/issues/536#issuecomment-117832524.

rykelly commented 9 years ago

I am going to implement this for SIPNET, for now by modifying model2netcdf.SIPNET(). Will default to keeping output, but have an option for deleting sipnet.out after conversion to netcdf is done.

To do this, I'm going to add an argument remove.raw.outputs to run.write.configs(), which will get copied into the job.sh as an argument to model2netcdf.SIPNET(). So in turn, I'm going to have to add the same argument to all models' write.config.*() functions, just so they don't throw an error when receiving it. At the moment it won't do anything for those other models though.

I just want to check that this sounds OK to everyone before moving forward. This seems convoluted to me, but I don't see a simpler solution.

robkooper commented 9 years ago

can you add it to the model section of pecan.xml, so we can pass that as an argument to write.configs. Maybe call it false</delete.raw> then also modify read_settings.xml to make sure that the default is added to the pecan.xml.

dlebauer commented 9 years ago

Will this have an associated tag in settings, similar to database$bety$write?

Or maybe there should be something more general, to flag runs for archiving vs testing?

rykelly commented 9 years ago

Yeah, was thinking to make this a setting in the .xml—thanks, @robkooper for the reminder to modify the template.

@dlebauer I also like the idea of a more general flag for testing, if there are multiple settings that make sense for testing vs. production.

mdietze commented 9 years ago

I agree with passing this through settings, not a new arguement.

Not sure that this choice is the same as testing vs archive -- we're still archiving the model runs here, just not the raw output. I think an arguement could be made for making this the default for sipnet, especially if we could go through the sipnet.out to make sure no output variables are dropped in conversion.

rykelly commented 9 years ago

On second thought, @robkooper is there any reason to assign a default of FALSE, rather than just

if(!is.null(settings$model$delete.raw) && settings$model$delete.raw) {
  ...
}

?

robkooper commented 9 years ago

Just so it is clear when somebody reads the pecan.xml file and not have to dig through the code to try and find out what the default is.

rykelly commented 9 years ago

OK, makes sense. So just have read.settings() assign the default value of FALSE?

robkooper commented 9 years ago

yup, just put it somewhere where the model is being parsed and checked.

PecanProject / pecan

Option to delete original model outputs after converting to netcdf #536