Smaller variant of multicol files for 3D grids (and beyond)

jhenin commented 4 years ago

Multicol files in their current states are great in 1D and 2D because they are read seamlessly by many plotting programs. In 3D they can get quite bulky, and plotting becomes less straightforward anyway. To reduce the size and the time spend reading and writing them, I see two options: using DX files, which have their own constraints, or a variant of multicol that doesn't have the x values (which can be completely generated based on the data in the header). Thoughts?

giacomofiorin commented 4 years ago

A semi-standard format for a 3D image is OpenDX itself (or at least the subset that is used by APBS, VMD, NAMD, Pymol, etc.). Like "multicol" it contains a header and is text-based, which makes it easy to inspect. I guess that the only constraint is not having the ability to encode a "periodic" flag?

Another option is the Matlab format, which is binary but at least is just as widely supported also by open-source code (Octave, SciPy, ...). Because it encodes multiple variables, there is the advantage of putting multiple fields in, e.g. mean gradients alongside histogram, periodic flag etc. It is probably possible to just reuse the open-source implementations. https://docs.scipy.org/doc/scipy/reference/tutorial/io.html#matlab-files NumPy itself doesn't seem to support 3D meshes easily. loadtxt/savetxt are 1D or 2D only, and the native .npy binary format is then specific to NumPy (good for many, but Matlab hasn't just disappeared yet).

giacomofiorin commented 4 years ago

Reading more about the format from the SciPy page and other open source libraries like this one: https://github.com/tbeu/matio It seems that since version 7.3 (released in 2006) the format is HDF5-based. I presume later versions are backward compatible with pre-7.3 files, but that alone makes it tricky to embed a writer or reader for code that should be used for several years :-1:

My original idea was to register the grids currently allocated in the various biases as elements in colvarparams, so that they can be fetched in scripts using their name. The inspiration for that was the extract() method of many LAMMPS classes, though I named it differently because the semantics are not the same. Then we'd have a bit more freedom to decide later how to use them in post-processing. Unless there is a tool ready that can easily read those files right after they're written (like GNUplot for 2D multicol), I wouldn't worry about changing the format. Does that sound like an option?

To reduce the space used by high-dimensional grids, I would also suggest in the doc to use outputFreq 0 for the respective bias, thus only write the state file, which is compact enough as it is for a text format. Do you see any scenario where disabling outputFreq would have unintended consequences?

HanatoK commented 4 years ago

What about just using an internal format to record the boundaries and widths, and then providing an post-processing tool to convert it to the current grid format?

giacomofiorin commented 4 years ago

That's along the lines of what I meant in the previous comments: the state file has all that information encoded. The scripting interface could easily provide access to the grid array, its boundaries, widths and periodic flags associated with it. Pros: leverages existing code. Cons: requires running the module for post-processing, which currently means using an updated VMD and a molecular structure file for the system.

If these requirements are too heavy, a post-processing script could be used as well. But then, the same script would need to replicate some of the internal functionality (particularly, the multicolumn grid I/O code).

HanatoK commented 4 years ago

@giacomofiorin I think NAMD with an updated colvars should be enough to read the state file and write the grid. Users can run simulation of "0" step with dummy psf and pdb files, only the colvars input and state files are from an actual simulation. It's just like merging ABF windows with inputPrefix.

jhenin commented 2 years ago

Here is a plan:

take the logic currently in the ABF bias and move it into: in dimension higher than 2, write a DX file in addition to multicol.
add a DX parser to colvars_grid.py (see the multimap analysis tools for an example)
remove redundant multicol output, use multicol if dimension <= 2 and DX otherwise.

@HanatoK @fabsugar Would this work for you?

giacomofiorin commented 2 years ago

@jhenin I pinged @fabsugar in person and he agrees with switching format for dimensions 3 and above.

One issue I didn't think about immediately when you wrote your last comment is how would you define the behavior for inputPrefix. Currently multicolumn is expected, but if all output is switched to DX for higher dimensions, then you'd need a DX reader in the C++ code as well, which is undesirable. We'd just end up defining yet another dialect of the DX format.

Are you okay with limiting use of the DX format for output/visualization only and perhaps begin using the state file format for input?

jhenin commented 2 years ago

Yes, I agree that we don't want to add a DX parser in there. The state file format seems ok, but we'll need the same flexibility offered by multicol: specifically the option of reading multiple grids, potentially with mismatched parameters, into the same bias.

giacomofiorin commented 2 years ago

I checked, in the colvargrid::read_restart() function there is this condition: https://github.com/Colvars/colvars/blob/7f850e071b2d34d5a54f872592af064a0841dbe8/src/colvargrid.h#L959-L966 that mimics a previous check first introduced in 2009 (before the start of the Git history), when the state file began containing the grid_parameters block to support metadynamics grid rebinning.

It's safe to say that nowadays the if () condition never evaluates to false, i.e. every instance of a grid object in a state file comes prefixed with a grid_parameters block.

So I think you can safely adapt the existing logic that read_multicol() currently has: https://github.com/Colvars/colvars/blob/7f850e071b2d34d5a54f872592af064a0841dbe8/src/colvargrid_def.h#L85 (To be precise, compared to the multicolumn format the state file lacks the periodicity flag, but that's not used to do the remapping anyway).

giacomofiorin commented 2 years ago

BTW @HanatoK I rememberd that there is also this case of inputting data through a grid. I hope it's fine to keep the format as multi-column, otherwise a DX parser would be needed :-(

Colvars / colvars

Smaller variant of multicol files for 3D grids (and beyond) #348