Open jhenin opened 4 years ago
A semi-standard format for a 3D image is OpenDX itself (or at least the subset that is used by APBS, VMD, NAMD, Pymol, etc.). Like "multicol" it contains a header and is text-based, which makes it easy to inspect. I guess that the only constraint is not having the ability to encode a "periodic" flag?
Another option is the Matlab format, which is binary but at least is just as widely supported also by open-source code (Octave, SciPy, ...). Because it encodes multiple variables, there is the advantage of putting multiple fields in, e.g. mean gradients alongside histogram, periodic flag etc. It is probably possible to just reuse the open-source implementations.
https://docs.scipy.org/doc/scipy/reference/tutorial/io.html#matlab-files
NumPy itself doesn't seem to support 3D meshes easily. loadtxt/savetxt
are 1D or 2D only, and the native .npy
binary format is then specific to NumPy (good for many, but Matlab hasn't just disappeared yet).
Reading more about the format from the SciPy page and other open source libraries like this one: https://github.com/tbeu/matio It seems that since version 7.3 (released in 2006) the format is HDF5-based. I presume later versions are backward compatible with pre-7.3 files, but that alone makes it tricky to embed a writer or reader for code that should be used for several years :-1:
My original idea was to register the grids currently allocated in the various biases as elements in colvarparams
, so that they can be fetched in scripts using their name. The inspiration for that was the extract()
method of many LAMMPS classes, though I named it differently because the semantics are not the same. Then we'd have a bit more freedom to decide later how to use them in post-processing. Unless there is a tool ready that can easily read those files right after they're written (like GNUplot for 2D multicol), I wouldn't worry about changing the format. Does that sound like an option?
To reduce the space used by high-dimensional grids, I would also suggest in the doc to use outputFreq 0
for the respective bias, thus only write the state file, which is compact enough as it is for a text format. Do you see any scenario where disabling outputFreq
would have unintended consequences?
What about just using an internal format to record the boundaries and widths, and then providing an post-processing tool to convert it to the current grid format?
That's along the lines of what I meant in the previous comments: the state file has all that information encoded. The scripting interface could easily provide access to the grid array, its boundaries, widths and periodic flags associated with it. Pros: leverages existing code. Cons: requires running the module for post-processing, which currently means using an updated VMD and a molecular structure file for the system.
If these requirements are too heavy, a post-processing script could be used as well. But then, the same script would need to replicate some of the internal functionality (particularly, the multicolumn grid I/O code).
@giacomofiorin I think NAMD with an updated colvars should be enough to read the state file and write the grid. Users can run simulation of "0" step with dummy psf and pdb files, only the colvars input and state files are from an actual simulation. It's just like merging ABF windows with inputPrefix
.
Here is a plan:
@HanatoK @fabsugar Would this work for you?
@jhenin I pinged @fabsugar in person and he agrees with switching format for dimensions 3 and above.
One issue I didn't think about immediately when you wrote your last comment is how would you define the behavior for inputPrefix
. Currently multicolumn is expected, but if all output is switched to DX for higher dimensions, then you'd need a DX reader in the C++ code as well, which is undesirable. We'd just end up defining yet another dialect of the DX format.
Are you okay with limiting use of the DX format for output/visualization only and perhaps begin using the state file format for input?
Yes, I agree that we don't want to add a DX parser in there. The state file format seems ok, but we'll need the same flexibility offered by multicol: specifically the option of reading multiple grids, potentially with mismatched parameters, into the same bias.
I checked, in the colvargrid::read_restart()
function there is this condition:
https://github.com/Colvars/colvars/blob/7f850e071b2d34d5a54f872592af064a0841dbe8/src/colvargrid.h#L959-L966
that mimics a previous check first introduced in 2009 (before the start of the Git history), when the state file began containing the grid_parameters
block to support metadynamics grid rebinning.
It's safe to say that nowadays the if ()
condition never evaluates to false, i.e. every instance of a grid object in a state file comes prefixed with a grid_parameters
block.
So I think you can safely adapt the existing logic that read_multicol()
currently has:
https://github.com/Colvars/colvars/blob/7f850e071b2d34d5a54f872592af064a0841dbe8/src/colvargrid_def.h#L85
(To be precise, compared to the multicolumn format the state file lacks the periodicity flag, but that's not used to do the remapping anyway).
BTW @HanatoK I rememberd that there is also this case of inputting data through a grid. I hope it's fine to keep the format as multi-column, otherwise a DX parser would be needed :-(
Multicol files in their current states are great in 1D and 2D because they are read seamlessly by many plotting programs. In 3D they can get quite bulky, and plotting becomes less straightforward anyway. To reduce the size and the time spend reading and writing them, I see two options: using DX files, which have their own constraints, or a variant of multicol that doesn't have the x values (which can be completely generated based on the data in the header). Thoughts?