MDAnalysis / mdanalysis

MDAnalysis is a Python library to analyze molecular dynamics simulations.
https://mdanalysis.org
Other
1.31k stars 650 forks source link

Limitations in the support of systems with periodic boundary conditions #4204

Open gpetretto opened 1 year ago

gpetretto commented 1 year ago

Dear developers, we are considering to use MDAnalysis as a tool to perform analysis on MD trajectories produced mainly with DFT codes (e.g. VASP, abinit), which use periodic boundary conditions. After digging in the existing parsers of the coordinates we are wondering if it would be easy for us to successfully use MDAnalysis in our case. In fact, the choice of storing the coordinates in cartesian format and the lattice with the dimensions attribute (i.e. lengths and angles) could easily lead to some problems if all the options are not properly taken into account.

In fact, this issue also serves as a bug report for the parsers that extract the lattice from a 3x3 matrix, without highlighting that the lattice should adhere to the MDAnalysis internal convention (i.e. the lattice should be represented by a lower diagonal matrix). Admittedly, I may have missed such a warning, since the (detailed and well written) documentation is pretty large.

As an example, consider the following input for FHI-Aims:

# comment line
lattice_vector 0.00000 2.71500 2.71500
lattice_vector 2.71500 0.00000 2.71500
lattice_vector 2.71500 2.71500 0.00000
atom_frac 0.00000 0.00000 0.00000 Si
atom_frac 0.75000 0.75000 0.75000 Si

Running the following code, parsing with the FHIAIMSReader:

import MDAnalysis as mda
from MDAnalysis.analysis import distances

u = mda.Universe("geometry.in", "geometry.in")
ts = u.trajectory.ts
distances.distance_array(ts.positions, ts.positions, box=ts.dimensions)

the output is

array([[0.        , 1.03126487],
       [1.03126487, 0.        ]])

which is incorrect. This is due to the fact that the periodic distance is calculated using a lattice which is different from the one used to generate the coordinates.

Using instead the following input geometry, that is equivalent to the previous one, but with a different orientation of the lattice vectors:

# comment line
lattice_vector 3.8395895957946777 0.0000000000000000 0.0000000000000000
lattice_vector 1.9197947978973393 3.3251821300646154 0.0000000000000000
lattice_vector 1.9197947978973393 1.1083940433548720 3.1350117771320236
atom_frac 0.00000 0.00000 0.00000 Si
atom_frac 0.75000 0.75000 0.75000 Si

the code above gives the correct result:

array([[0.        , 2.35125862],
       [2.35125862, 0.        ]])

I am less familiar with other input formats supported by MDAnalysis that describe the lattice in the form of a 3x3 matrix, but I suspect that similar issues might happen if the box is transformed using the triclinic_box() function, without checking if the coordinates were defined in a consistent way.

I would also be available to contribute parsers for our outputs that will properly consider this point, but it would require additional manipulation of the data, which might slow down the parsing. In addition, I am afraid that some other issues, that I cannot foresee now, might arise at a later stage.

In general, I believe that a more robust way of handling the systems with PBC would be to allow to store the triclinic_dimensions instead of the dimensions in the Timestep, but I imagine that you had good reasons for choosing this representation. And in any case I expect that such a change would require a considerable effort.

I have also seen that some analysis tools require the cell to be unwrapped for the analysis to be performed correctly, but no such tool is directly available in MDAnalysis (e.g. MSD).

May I ask to comment on the previous considerations and let us know if there is any plan to improve the support for systems with PBC? Also, we can contribute to these improvements. If you are interested, can you mention which kind of changes you would accept in this direction? (e.g. a transformation to unwrap trajectories, a better handling of the lattice box, ...)

Thanks a lot for your reading, input and support!

richardjgowers commented 1 year ago

Hi Guido

Thanks for raising an interesting question. A few quick thoughts...

You're right that we assume a convention that the lower left of a 3x3 matrix is used to describe the box (when a 3x3 matrix is used). This is converted into the [lx, ly, lz, alpha, beta, gamma] representation of three lengths and three angles as the canonical format. The code in question is here: https://github.com/MDAnalysis/mdanalysis/blob/develop/package/MDAnalysis/lib/mdamath.py#L246

I think that we should have a) probably thrown an error when a 3x3 matrix not matching our assumptions was provided and b) been able to convert your first 3x3 matrix into the second one you defined (so that things "just worked").

Regarding your last point about fixing this, I think this is the line to start at:

https://github.com/MDAnalysis/mdanalysis/blob/develop/package/MDAnalysis/coordinates/FHIAIMS.py#L182

This is calling the function which assumes the lower left convention, if this was replaced by a function which didn't assume this then I think we might then support PBC given in that format.

Thanks Richard

On Tue, 18 Jul 2023 at 09:26, Guido Petretto @.***> wrote:

Dear developers, we are considering to use MDAnalysis as a tool to perform analysis on MD trajectories produced mainly with DFT codes (e.g. VASP, abinit), which use periodic boundary conditions. After digging in the existing parsers of the coordinates we are wondering if it would be easy for us to successfully use MDAnalysis in our case. In fact, the choice of storing the coordinates in cartesian format and the lattice with the dimensions attribute (i.e. lengths and angles) could easily lead to some problems if all the options are not properly taken into account.

In fact, this issue also serves as a bug report for the parsers that extract the lattice from a 3x3 matrix, without highlighting that the lattice should adhere to the MDAnalysis internal convention (i.e. the lattice should be represented by a lower diagonal matrix). Admittedly, I may have missed such a warning, since the (detailed and well written) documentation is pretty large.

As an example, consider the following input for FHI-Aims:

comment line

lattice_vector 0.00000 2.71500 2.71500 lattice_vector 2.71500 0.00000 2.71500 lattice_vector 2.71500 2.71500 0.00000 atom_frac 0.00000 0.00000 0.00000 Si atom_frac 0.75000 0.75000 0.75000 Si

Running the following code, parsing with the FHIAIMSReader:

import MDAnalysis as mdafrom MDAnalysis.analysis import distances u = mda.Universe("geometry.in", "geometry.in")ts = u.trajectory.tsdistances.distance_array(ts.positions, ts.positions, box=ts.dimensions)

the output is

array([[0. , 1.03126487], [1.03126487, 0. ]])

which is incorrect. This is due to the fact that the periodic distance is calculated using a lattice which is different from the one used to generate the coordinates.

Using instead the following input geometry, that is equivalent to the previous one, but with a different orientation of the lattice vectors:

comment line

lattice_vector 3.8395895957946777 0.0000000000000000 0.0000000000000000 lattice_vector 1.9197947978973393 3.3251821300646154 0.0000000000000000 lattice_vector 1.9197947978973393 1.1083940433548720 3.1350117771320236 atom_frac 0.00000 0.00000 0.00000 Si atom_frac 0.75000 0.75000 0.75000 Si

the code above gives the correct result:

array([[0. , 2.35125862], [2.35125862, 0. ]])

I am less familiar with other input formats supported by MDAnalysis that describe the lattice in the form of a 3x3 matrix, but I suspect that similar issues might happen if the box is transformed using the triclinic_box() function, without checking if the coordinates were defined in a consistent way.

I would also be available to contribute parsers for our outputs that will properly consider this point, but it would require additional manipulation of the data, which might slow down the parsing. In addition, I am afraid that some other issues, that I cannot foresee now, might arise at a later stage.

In general, I believe that a more robust way of handling the systems with PBC would be to allow to store the triclinic_dimensions instead of the dimensions in the Timestep, but I imagine that you had good reasons for choosing this representation. And in any case I expect that such a change would require a considerable effort.

I have also seen that some analysis tools require the cell to be unwrapped for the analysis to be performed correctly, but no such tool is directly available in MDAnalysis (e.g. MSD https://docs.mdanalysis.org/2.0.0/documentation_pages/analysis/msd.html ).

May I ask to comment on the previous considerations and let us know if there is any plan to improve the support for systems with PBC? Also, we can contribute to these improvements. If you are interested, can you mention which kind of changes you would accept in this direction? (e.g. a transformation to unwrap trajectories, a better handling of the lattice box, ...)

Thanks a lot for your reading, input and support!

— Reply to this email directly, view it on GitHub https://github.com/MDAnalysis/mdanalysis/issues/4204, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGSGB6BPWK6NX46GABV4MLXQZCCBANCNFSM6AAAAAA2OA4EQY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

orbeckst commented 1 year ago

In general, I believe that a more robust way of handling the systems with PBC would be to allow to store the triclinic_dimensions instead of the dimensions in the Timestep, but I imagine that you had good reasons for choosing this representation. And in any case I expect that such a change would require a considerable effort.

I wholeheartedly agree with you. The current [Lx, Ly, Lz, alpha, beta, gamma] representation is just a historical artifact that isn't easy to change without breaking a lot of code.

We proposed a project on General Unit Cell Representation for a GSoC but we didn't get any good candidates. But the project proposal shows you that we're generally open to the idea to work on this problem and would welcome a sustained effort in this area.

orbeckst commented 1 year ago

I am adding the defect label because the FHI-AIMS reader should at least be failing if an unsupported box is provided. Code that silently produces wrong results is bad.

gpetretto commented 1 year ago

Thanks for sharing your thoughts on the topic and linking the GSoC page. I don't think we will have the resources to undertake a large refactoring of the code. However, if there is interest from your side, I will propose a possible way forward. I only went through some parts of the code, so what I propose might be problematic or not be possible at all.

At the moment, if Timestep.triclinic_dimensions is set, this is converted to dimensions internally through a setter. An option would be to change the internal _unitcell attribute to be the 3x3 matrix, so that when triclinic_dimensions is set the whole matrix is properly stored and it is the dimensions attribute that is generated when accessed. In principle this should have limited impact on the code, unless there are functions that access the triclinic_dimensions property, but still assume that the cell has the standard orientation. If such a change is possible, code that rely on explicitly passing the box (e.g. distances.distance_array) could be upgraded, progressively and when needed, to allow the box argument to be a 6x1 vector and a 3x3 matrix. Over time one can consider whether to deprecate the use of the 6x1 vector as an input.

Do you think this would break some part of the code? Will the tests allow to easily identify issues?
If you think this might be a viable option, requiring a limited amount of refactoring to start the changes, I can open a PR implementing it.

As a side note, we have also realized that one more limitation of the current approach is that it could easily lead to problems in the case of simulations with variable cell.

gpetretto commented 1 year ago

Following up on this discussion, I have tried to implement what mentioned in the previous message to check its feasibility. Briefly, I reversed the role of Timestep.triclinic_dimensions and Timestep.dimensions. The former is stored internally and the latter is obtained converting the former on the fly.

Unfortunately, more tests than I had expected started failing. The main reason is the fact that the dimensions are often modified in place. I was not expecting this, since the dimensions are stored in an internal variable and only exposed through a property, but I imagine that this could be relevant to achieve better performances. I started to adapt some portions of the code to work properly with triclinic_dimensions, but it is not always trivial and I expect that some changes might impact other people's work. I also expect most of the existing code to be based on dimensions rather than triclinic_dimensions, so the additional conversion performed when accessing dimensions might have some impact on the performances.

I was hoping to get a minimal working implementation to be reviewed, but based on what I have seen and the expected amount of work required I would rather get some feedback on this option before proceeding any further. If it could be helpful I can open a WIP PR to show what I tried to change.

richardjgowers commented 1 year ago

@gpetretto I think we could store the dimensions as any 3x3 matrix internally. What we would need to preserve would be that .dimensions and .triclinic_dimensions are returning a 6x1 and 3x3 lower diagonal representation, since all the distance functions are expecting that variant. We could then add an additional property raw_triclinic_vectors(? better names welcome) which would return the non-lower diagonal variant.

In terms of setters, I think we can allow setting from all three representations without too much headache.

Feel free to put up a WIP PR

orbeckst commented 1 year ago

Just to reinforce what @richardjgowers said: put up a PR, it's a lot easier to talk about concrete code than abstract code ;-).