Adding support for bonds

ejmeitz commented 9 months ago

Just wanted to start a discussion about potential ways to store bond information in the current AtomsBase object as well as how trajectories could be standardized.

Bonds could just be added as an optional keyword with format up to the user or the format could be dictated. Either way there should be the option to include this data I think. ~- Trajectories I'm a little less certain how to include as they are dynamic and AtomsBase fundamentally is not. That said a lot of analysis functions act on atomic trajectories not systems and it could be nice to unify trajectory formats as well.~

cortner commented 9 months ago

maybe those are two different discussions? Start a separate thread for the trajectories? (btw, I only ever needed to treat trajectories as a list e.g. Vector or structures.)

cortner commented 9 months ago

Re bonds : that's usually information stored in a neighbourlist. Are you asking how to cache the neighbourlist? I see no barrier to doing this right now.

rkurchin commented 9 months ago

@cortner I think the question is about adding functions to the interface for specifying bonds, which could be useful in a number of interoperability scenarios including visualization, but also for things like constructing molecular graphs for ML applications, etc.

rkurchin commented 9 months ago

Also, a neighbor list is a bit different from bonds, since one has to do with spatial proximity and the other with specific chemical interactions (and, in the molecular simulation context, often geometric constraints).

ejmeitz commented 9 months ago

Yes I'm more after adding function to the interface to standardize bonds in AtomsBase.

cortner commented 9 months ago

To my mind this sort of interface necessarily involves neighbourlists. I thought it is the responsibility of the neighbourlist to provide lists of bonds - ideally lazily since anything beyond pair bonds would scale very badly. But if that's not how people commonly think about it, that's fine of course.

The op talks about "potential ways to store bond information" hence I asked about caching. If one wants to repeatedly access the list of bonds and if that list is provided via a neighbourlist, then I think this is related.

jgreener64 commented 9 months ago

In molecular mechanics terms the bonds are defined during setup and fixed throughout the whole simulation. The question is whether we should represent that information (a graph, effectively) in AtomsBase.

The neighbour list calculates close pairs from all possible atom pairs and is used to calculate non-bonded interactions (Lennard-Jones and Coulomb). There is the complexity that pre-defined bonds are important in this context, since atoms linked by one or two bonds typically have their non-bonded interactions excluded and atoms linked by three bonds typically have their non-bonded interactions down-weighted (huge hacks).

cortner commented 9 months ago

In molecular mechanics terms the bonds are defined during setup and fixed throughout the whole simulation.

Thank you for clarifying this for me. If that's the concrete context, I still think this is something to be calculated in a separate object and then stored somewhere. But why in a structure and not in the calculator? I can see arguments for both. Anyhow, any object can already be stored? The calculator then needs to know how to find it.

If there really is a need for an interface (which I don't see yet) then I think this is actually not unrelated to #84 and #86 --- i.e. providing interfaces for calculating things.

tjjarvinen commented 9 months ago

The "standard" way of doing molecular mechanics is that you give an initial structure. Then you setup topology for the system. This "topology" meaning you give bonding information for the system. These bonds are not broken during the simulation and (usually) harmonic forces are setup for the bonds.

So, the molecular mechanical force field is two parts. One that comes form "topology" and dispersion/Coulomb part that is added with the help of pairlist calculation.

The question here seems to be: should we have a some standard way of implementing topology for the system.

I would say that separate structure might be a way to go. The issue I see here is that most file formats don't support topology information and if we have it in System-structures it could make saving/loading difficult. But we could make an abstract structure AbstractSystemTopology that would hold additionally topology information, but would otherwise work like AbstractSystem.

rkurchin commented 8 months ago

Chatted with @ejmeitz a bit more about this today and we may put together a PR for a function to return bond information. A few questions for discussion and our preliminary answers:

Does this belong in AtomsBase or in another package? I think it's a minimal enough feature that has a trivial implementation for the case where no bonds are defined which we could make the default, requiring nothing additional from <:AbstractSystem objects that don't store this information. My vote would be to put it here, at least for now.
What should the format of the returned data be? We see basically two options here -- an adjacency matrix or a list of bonds (e.g. pairs of atom indices). We lean towards a bond list because it requires minimal memory, is trivial to get the adjacency matrix from, and (I understand) is the format that codes that actually use this information typically work with anyway.
What should the name of the function be? Presumably just bonds?

Ethan's primary interest right now is in using this for visualization purposes, but I could imagine it being useful for building graphs for ML things, defining potentials for Molly in other packages, etc. @jgreener64, do you have any opinions on this?

More as a note to self, things to add in PR once these decisions are made:

obviously the actual function signature and a default implementation (empty list if list or N x N square matrix of zeros if adjacency matrix, where N is length(sys))
a new example system type (BondedSystem or similar?) to demonstrate actual implementation
some tests for both cases that don't have bonds defined and the new one that does
(not for this specific PR necessarily, but a thing to be aware of) We should also add support for this to any file I/O packages as well e.g. AtomsIO

If nobody has strong opinions (or at least nobody is vehemently opposed to this existing 🤪), we'll probably just make a prototype in a PR sometime soon and move further discussion there.

ejmeitz commented 8 months ago

I was also wondering if it was important to add some way to differentiate between types of bonds (at a minimum double/triple bonds).

jgreener64 commented 8 months ago

No strong opinions on this, I guess it could be nice in some situations, it could go in AtomsBase.jl provided that sensible defaults are defined so existing systems don't need to change. bonds sounds okay.

List of bonds seems best since an adjacency matrix would want to be sparse to avoid N^2 scaling, and sparse matrices are basically lists of indices anyway. Maybe [(i1, j1, bond_order1), (i2, j2, bond_order2), ...] or something similar.

cortner commented 8 months ago

Despite resistance to my comments above - I still think a list of bonds and a neighbourlist are very closely related objects (connectivity). It would be great to give some thought to a general interface. This shouldn't prevent experimentation of course.

Regarding the questions above:

Does this belong in AtomsBase or in another package?

I think this is all about structure, so AtomsBase is fine for me.

What should the format of the returned data be?

an iterator I think. One iterator for pairs, one for triplets and so forth. I'm thinking of

compute_bonds!(structure, ...)
for (i, j, ...) in pairs(structure)
    # do something
end
for (i, j1, j2, ...) in nclusters{3}(structure, ...)
    # do something else
end

Of course they can always be collected and there could be convenience wrappers for that, which can also be overloaded.

But with iterators you don't presume how the list is stored.

What should the name of the function be?

This is not clear to me and highly application specific. I think some thought should go into designing this. see above.

bonds

the problem with a single call to all possible bond types is that it will necessarily be type-unstable.

cortner commented 8 months ago

new example system type

Why can't this be stored in the meta-data of the existing systems? If there are no bonds stored in the meta-data then the call to bonds or similar will just fail.

cortner commented 8 months ago

PLEASE PLEASE do not re-implement the neighbourlist for the 5th time? Can we rather talk about integrating one or all of the existing ones?

rkurchin commented 8 months ago

I agree that bonds and neighbor lists are related conceptually, but I think the way they are used in actual simulation (and also in visualization, which is the particular use case Ethan is working to implement here) is quite different. As a few examples (which largely summarize comments already made earlier in this thread):

in molecular simulation, neighbor lists are mainly used for computational efficiency (to know with which other particles interactions need to be calculated) and will typically be updated every so many timesteps during a simulation, while bond topology is typically (though not always) fixed
perhaps more importantly, neighbor lists are computed based on distances/coordinates (and in fact could be computed just given a structure that didn't happen to include them), while bond lists are a property of a structure that one wouldn't in general be able to accurately infer (they are also often used to impose constraints on separation distances, either rigidly or by special pairwise interactions that only apply to bonded particles)

We could certainly store either piece of information in metadata (and I think it's a good idea to include that as an example and/or in the tests, since the interface is of course agnostic to where/how the information is actually stored in the object), but I think it's an important enough piece of information in certain contexts that having its own function (and hence standardized format) to access it is justified. @ejmeitz, maybe you could point folks to your visualization stuff to make clear why this would be nice at least in this particular use case?

rkurchin commented 8 months ago

Why can't this be stored in the meta-data of the existing systems? If there are no bonds stored in the meta-data then the call to bonds or similar will just fail.

Not if we have a sensible fallback default implementation that just says "there are no bonds" (as described in my earlier comment).

rkurchin commented 8 months ago

the problem with a single call to all possible bond types is that it will necessarily be type-unstable.

That is a fair point. If we do want to support designating bond types (as opposed to just the presence of a bond via a pair of indices), we should talk about how to deal with that. One option is of course to set up an initial implementation that doesn't support this and punt the discussion to later...or just a limited specification like what Joe suggested above (i.e. just a number that indicates bond order).

cortner commented 8 months ago

I agree that bonds and neighbor lists are related conceptually, but I think the way they are used in actual simulation (and also in visualization, which is the particular use case Ethan is working to implement here) is quite different.

That doesn't change the fact that the fundamental "thing" that is a list of bonds remains the same in those use-cases. It's only how they are used that differs. I therefore disagree that they are different objects and should be treated differently.

cortner commented 8 months ago

The justification for a separate BondedSystem type is probably that you are then signalling that the bonds remain fixed. Is that the intention? This makes sense to me.

But to my mind, a BondedSytem is then just a pair BondedSystem(System, BondList).

Sorry if I missed this in the long thread above.

ejmeitz commented 8 months ago

The use case I personally need this for are for generic visualization (e.g. rendered image of balls/sticks) of an AtomsBase system where I need to know which atoms should have a connection between them in the image.
As for the type stability of a bond representation I think Vector{Tuple{Integer, Integer, Symbol}} is a type stable way to store this data
- Iterators would be nice, but that it is rather trivial to extract the set of double/triple/etc. bonds from this Vector with findall at the cost of doing this calculation once.
As for neighbor lists, this is something that is really only used in a molecular dynamics simulation and is not really a property of an atomic system whereas bonds are. They way I see it the functionality to calculate neighbor lists belongs in a separate library like Molly or CellListMap where they can define a function neighbors(as::AbstractSystem) for use internally (like in Molly) or to export as a functionality the end user can take advantage of. Furthermore, I don't really understands the argument that implementing bonds like this would be a 5th reimplementation of neighbor lists. At best the bond list is a subset of the information contained in a neighbor list.

cortner commented 8 months ago

stability of a bond representation

then you have only pairs. Sure, that's fine. Except don't use Symbol, that's type-stable not isbits.

cortner commented 8 months ago

As for neighbor lists, this is something that is really only used in a molecular dynamics simulation and is not really a property of an atomic system whereas bonds are. They way I see it the functionality to calculate neighbor lists belongs in a separate library

We are not talking about writing new functionality, but about how to store such a list.

A list of pairs is exactly the default output from a neighbourlist.

cortner commented 8 months ago

I'm not trying to be contrary for fun here and as soon as I see a convincing argument that a list of bonded pairs is fundamentally something different from a neighbourlist, I'll back off. So far I don't see it.

ejmeitz commented 8 months ago

We can use an enum for the type of the bond then to make it isbits.

What bonds aren't between pairs of atoms? Like if I have a methane molecule that information is almost always stored as 4 separate bonds in MD packages.

Yeah a list of pairs is the default neighbor list but not all neighbors in the neighbor list are literally bonded to each other. They're just near each other spatially. A bond list is a more specific piece of information than a neighbor list.

jgreener64 commented 8 months ago

They are the same (bar extra data) in that they represent connectivity, but different in that a bonds function and a neighbors function would return different things on the same system. To me the interface is defined by the documentation of what is returned, rather than how it is stored internally.

The key difference when implementing is that bonds are computed once at the start whereas neighbours are re-computed during simulation. This puts more restraints on the internal details for the neighbour list since the data structures have to be updated whilst considering performance.

It would be nice to have a unified interface for accessing pairs of connected entities in molecular objects, which would encompass both. It will be harder to get consensus for neighbour lists though, since a performant interface may put requirements on the internal storage.

cortner commented 8 months ago

To me the interface is defined by the documentation of what is returned, rather than how it is stored internally.

yes, exactly.

It would be nice to have a unified interface for accessing pairs of connected entities in molecular objects, which would encompass both

that's what I'm trying to achieve

It will be harder to get consensus for neighbour lists though, since a performant interface may put requirements on the internal storage

I do not want consistent use of internatl storage, everybody should do what they want. Hence the suggestion to use an iterator interface. But that's a weak suggestion and should be thought about more carefully.

cortner commented 8 months ago

Generally speaking, AtomsBase is about interfaces, not about concrete implementations. In order to not waste more time here, how about @ejmeitz implements whatever works best in your visualisation package and once this is usable we return to the discussion what the generic AtomsBase interface for this ought to look like?

cortner commented 8 months ago

One last comment from me:

.. would return different things on the same system.

First, I'm not sure it really matters. It's just two different neighbour lists for the same system. Secondly, there are also dynamic neighbourlists that are adaptive to the various properties of the system such as chemical species.

rkurchin commented 8 months ago

One question I haven't seen a compelling answer to as yet (and I think is important to think about to guard against premature optimization) is: what is the actual use case of a generic neighbor list interface? The only use case of neighbor lists I'm aware of is internal to an MD simulation, where it doesn't need to be passed around to other codes anyway. I suppose in constructing graphs for ML applications one might make use of neighbor lists but given that a neighbor list is not an entirely unambiguous quantity anyway (i.e. there are "hyperparameters" such as cutoff distance, whether to consider PBC's, etc.), they would likely be manually recomputed in such a situation anyway. Put another way, I think defining a set of bonds is both:

1) less ambiguous than definition of neighbors (not 100% unambiguous of course, but nothing really is), and 2) not in general possible from just knowing the set of coordinates (whereas one can (re)construct a neighbor list from this information)

These two points, to me, justify bonds being considered "special" in this sense.

As I said before, I do agree that these are similar objects in the sense of the shape of the data, which I think is @cortner's core point. In a certain sense, I think we've been talking past each other a bit from a terminology perspective, because I think Christoph is thinking about these things purely from a data structures perspective, whereas some of us are also incorporating our physical understanding of what these ideas actually mean (e.g. sharing electrons vs. just being near each other in space). Obviously, in the sense that these are both "lists of pairs of atoms," that distinction perhaps shouldn't matter, but my feeling is still that what they're actually used for in practice (i.e. what we do with that data) is sufficiently different as to consider them to be different things. Of course, if we do decide that we want the interface to support both, I do agree that it should function similarly for both kinds of information. I'm just not currently convinced that AtomsBase needs a function for neighbor lists at all.

Sorry, this one ended up a bit longer than I imagined. 😅

cortner commented 8 months ago

I'm just not currently convinced that AtomsBase needs a function for neighbor lists at all.

sure - there is no rush.

what is the actual use case of a generic neighbor list interface?

Swapping out different nlist implementations for different simulation scenarios such as large-scale, small-scale, MIC, no MIC, different architectures, performance, and so forth.

purely from a data structures perspective

no, in fact the point is there can be many different data structures to represent neighours / topology / bonds / whatever you want to call them. I am saying that conceptually they are the same objects.

I don't think we've been talking past each other. I am just challenging some of the assumptions made in this thread, which Teemu educated me is based on the fact that I'm the only non-Chemist in this room.

JuliaMolSim / AtomsBase.jl

Adding support for bonds #87