Luthaf / rascaline

Computing representations for atomistic machine learning
https://luthaf.fr/rascaline/
BSD 3-Clause "New" or "Revised" License
44 stars 13 forks source link

bond environments: bond-atom neighbouring #240

Open liam-o-marsh opened 11 months ago

liam-o-marsh commented 11 months ago

the changes I talked about, split into three commits that can be reviewed separately

There's just one point where I know feedback is needed, it's for the new API for the systems. (specifically, the huge incoherence between pairs_per_species and triplets_per_species, that can be seen in the comments of the two methods)


:books: Documentation preview :books:: https://rascaline--240.org.readthedocs.build/en/240/

Luthaf commented 11 months ago

Thank you for the PR!

These changes are quite big, so could you actually make one PR with only the first commit, so we can discuss it & improve on it, and then later add the other commits in separate PRs?

I see a lot of changes to the C API, which I would prefer to keep minimal, since any additional function there impose requirements to new system implementations.

Could you explain what the batriplet is and why it is necessary to have this information in the C API? From a very quick look, It seems that this information could be rebuilt inside the calculator that needs it using only the neighbors list.

liam-o-marsh commented 11 months ago

sure! "BAtriplet"s are short for Bond/Atom triplets. They contain three atoms, two grouped in a bond (that acts as a center), and a third atom which acts as a neighbour.

I'm not sure what alternatives there would be to using this object explicitely.

If I were to create a new system with "ghost atoms" to serve as bond centers, I could represent the old system's triplets with the new system's pairs? But a lot of the pairs would end up being bond/bond or atom/atom, and would go to waste. Unless I deliberately break compute_pairs to not compute all of them. Plus, I need the orientation of the bond, and I don't think I can fit that info in the ghost atoms.

I guess one way to bypass changing the system trait would be to have calculators::bondatom_neighbor_list::BANeighborList call directly systems::neighbors::TripletNeighborsList but I'm not sure how many things I would have to reroute and how rascaline-idiomatic it is to do so.

Luthaf commented 11 months ago

I'm not sure what alternatives there would be to using this object explicitely.

Inside SphericalExpansionForBonds or the other calculator, you could get two neighbors list, one with the bond cutoff and one with the neighbors cutoff:

system.compute_neighbors(bond_cutoff);
bonds = system.pairs();

// atoms_cutoff needs to be a bit bigger than the one in the current 
// implementation to be sure we get the same set of neighbors.
system.compute_neighbors(atoms_cutoff); 

let mut ba_triplets = vec![];
for bond in bonds {
    let mid_point = positions[bond.first] + bond.vector / 2;

    for pair in system.pairs_containing(bond.first) {
        if distance_from_mid_point_is_below_cutoff {
            ba_triplets.push(bond, other_atom_in_the_pair);
        }
    }

    for pair in system.pairs_containing(bond.second) {
        if distance_from_mid_point_is_below_cutoff {
            ba_triplets.push(bond, other_atom_in_the_pair);
        }
    }
}

This way, no changes to the system are required, and you can still get the data you need.


I guess one way to bypass changing the system trait would be to have calculators::bondatom_neighbor_list::BANeighborList call directly systems::neighbors::TripletNeighborsList but I'm not sure how many things I would have to reroute and how rascaline-idiomatic it is to do so.

This is not possible, because the System trait is here to abstract over system providers, meaning the system you get as input to your calculator are not necessarily systems::SimpleSystem. They could be something provided in rascaline-torch: https://github.com/Luthaf/rascaline/blob/55ad1f14363812639c865801a6d517d8c5f363f6/rascaline-torch/include/rascaline/torch/system.hpp#L19-L24

or even a Python implementation, where compute_neighbors will call pure Python code: https://github.com/Luthaf/rascaline/blob/55ad1f14363812639c865801a6d517d8c5f363f6/python/rascaline/rascaline/systems/ase.py#L18

This is also why I am reluctant to add new functions to the System API, since that mean implementing this function everywhere, and while there is a neighbor list that can be adapted relatively easily in ASE/LAMMPS/…, nothing similar exists for bond-atom triplets. Computing the bond-atom triplets from two pairs list would solve this.


EDIT: sorry, clicked the wrong button.

liam-o-marsh commented 11 months ago

oh, I see, thanks! Now that I managed to work on this a little, I found another problem: No matter how i slice it, the calculators I create need to have access to this list of triplets in their .keys(), .samples(), .positions_gradient_samples() and .compute(). And for now, the only possibilities I see are really not satisfactory:

If it is true that none of those ideas would cut it, how do you think I should proceed?

Thanks, and sorry for the late reply.

Luthaf commented 11 months ago

Another solution would be to allow the calculator to store data in the system, but I don't love the implications of this either.

I'll think about this over the weekend, and then maybe we can have a chat to find the best way forward!

liam-o-marsh commented 11 months ago

sure! have a nice weekend, and see you then!

liam-o-marsh commented 11 months ago

A thing I thought about for this would be the possibility to add other methods to BaseCalculator that would be called with a &mut [Box<dyn System>], but also caches: &mut [BTreeMap<String,Any>]. and the default version of these new methods (implemented in trait) would fall back to keys,samples,compute,etc...

not exactly great either (and it feels way jankier than just allowing systems to carry a cache around), but probably less disruptive than changing the System trait.

Luthaf commented 11 months ago

Yes, this kinda what I landed on over the weekend. Replace the input of compute from &mut [Box<dyn System>] to &mut [SystemWithData] (name to be improved) where

struct SystemWithData {
    system: Box<dyn System>,
    data: BTreeMap<String,Any>
}

I think this is the cleanest solution here

liam-o-marsh commented 11 months ago

that does sound way better than what I suggested. Thank you!

Luthaf commented 11 months ago

I've implemented this in the system-with-data branch (the struct is called System, the trait is now SystemBase). Could you try to use it and let me know if this works for you?

For now, I allowed the data to be Any, but I'm also thinking to restrict it to TensorBlock instead (since we will have a similar mechanism in metatensor for atomistic simulations). Here as well, let me know if a TensorBlock would be enough for you or if you need the extra flexibility of Any!

liam-o-marsh commented 11 months ago

ah, should have replied earlier. thanks, I'll try to rebase my changes on top of that, sometimes next week. (found a thorny math bug in some of my own code, it will take a while to actually pinpoint)

liam-o-marsh commented 10 months ago

and that should be it (…just note that it now has the changes from the systems_with_data branch, so I had to switch the target branch of the PR)

liam-o-marsh commented 10 months ago

I'll be using a PR within m-stack-org to make sure it passes CI tests. (which catches more errors than cargo test and cargo bench)

liam-o-marsh commented 10 months ago

and most of your remarks should be taken into account (notable exception: the ones that were about "TODO"s left in the code)

liam-o-marsh commented 9 months ago

question: are there specific things I need to do here before it gets merged?

EDIT: let me merge to take the clebsch-gordan restructure into account

Luthaf commented 9 months ago

Sorry for the delay, I was deep in another part of development which is now reaching its end; so I should have more time for rascaline work.

There is not much you need to do for now, I have to do another full round of review on this & finish + merge #260.


One point I wanted to discuss with you was how we could expose the fact that this is a very new, still experimental representation that might be changed if we realise it needs to do some things differently. My idea here was to have the code in the main library and make sure it continues working & passing tests, but force the user to opt-in and acknowledge this is experimental code if they want to use it.

An environment variable would be a good way to do this:

export RASCALINE_EXPERIMENTAL_BOND_ATOM_SPX=1
python script.py

or

import os
os.environ["RASCALINE_EXPERIMENTAL_BOND_ATOM_SPX"] = "1"

from rascaline import BondAtomSphericalExpansion

What do you think?

Another point about this is when will the code move out of this experimental state and always be available. I'm thinking having one published paper with it + the main author (you in this case, but we will use similar mechanism for other new representation later) agreeing that the code is ready for general use.

liam-o-marsh commented 9 months ago

Sorry for the delay, I was deep in another part of development which is now reaching its end; so I should have more time for rascaline work. There is not much you need to do for now, I have to do another full round of review on this & finish + merge #260.

OK! To be honest, I was also busy, with my candidacy exam.

export RASCALINE_EXPERIMENTAL_BOND_ATOM_SPX=1

Sure!

Another point about this is when will the code move out of this experimental state and always be available. I'm thinking having one published paper with it + the main author (you in this case, but we will use similar mechanism for other new representation later) agreeing that the code is ready for general use.

That seems like a good rule of thumb for this sort of thing. Thank you for thinking about this!

liam-o-marsh commented 5 months ago

oh no er.... I might have broken the PR... (I merged the changes from main) I forgot it targeted the system-with-data branch rather than the main branch... (speaking of, isn't system-with-data ready for prime time by now?)

Luthaf commented 5 months ago

speaking of, isn't system-with-data ready for prime time by now?

Yes, it should be. I mainly need to document and merge it!

EDIT: PR breakage is fine, we can fix it =)