crcollins / molml

A library to interface molecules and machine learning.
MIT License
65 stars 17 forks source link

Missing Br (and potentially others) #6

Closed sooheon closed 4 years ago

sooheon commented 4 years ago

cm.fit_transform([[['Br'], [[0, 0, 0]]]])

=>

KeyError: 'Br'

I guess there's a lookup table for charges somewhere, and Br and potentially other atoms are missing from it?

crcollins commented 4 years ago

Yeah, this is expected behavior. The atomic constants are outside of the scope of this library (for better or worse). The ones that are included are there as seed data for simple tests.

See here for an example of adding them. https://github.com/crcollins/molml/blob/master/examples/missing_constants.py TL;DR: Add the values needed to the dicts before calling the fit_transform.

Ideally, there would be a simple (read pure Python) library that had all the needed constants (numbers, symbols, distances, etc) without the bulk of a full library.

Here are the few constants MolML includes by default. https://github.com/crcollins/molml/blob/master/molml/constants.py Depending on how you installed MolML, the Br constants may not be included. They were added after the most recent release.

sooheon commented 4 years ago

Gotcha. Is it a matter of adding these values to the bidict, or is there some nuance I'm not aware of, such as the values not being universal constants? (coming from ML side, have no domain knowledge).

crcollins commented 4 years ago

Yeah, you just put the atomic numbers there. The atomic number being the number of the element in the periodic table (also listed as Z). Hydrogen is 1, Carbon is 6, Oxygen is 8, etc. Assuming that is what you were referring to from that table. If you were meaning Z effective, that is slightly different, and sort of its own can of worms to your “universal constants” remark. They are universal in a strict sense, but the details I guess are debatable.

In principle, you could use these. It is just a matter of adding them either replacing the values in the bidict, or by making new “elements” so to speak (this part of the reason the constants are not all included). For example, you could add “C_zeff” as an element and use the Z effective value for that. This has potential issues on the reverse side of the bidict due to float comparisons, but I think it is fine if you only key on the elements.

In practice though, I don’t think Z values are meaningful outside of being a separate input label class. From my own work, I have noticed most things are dominated by distance interactions. Or in ML parlance, Z values produce poor embedding spaces when describing how atoms work in molecules.

sooheon commented 4 years ago

Sounds good. Since coulomb matrices are derived from distance + domain info, and the domain info is full of caveats, I'm also leaning toward just using the distance information to produce embeddings. Thanks!

crcollins commented 4 years ago

Well, it isn’t that it is solely distance, but it is a large fraction. You can test it with looking at using all 1s for the atomic numbers vs using Zs vs using something like the Z effective like you showed before. I don’t remember the numbers off hand, but for many tasks it is like 95%+ comes from distance (all 1s are effectively only distances), another 5% from Zs, and the other is rounding error.

Though, you should strictly prefer BagOfBonds over CoulombMatrix. It resolved several issues with atom ordering, and it is based off the same thing (it is a better shuffling of the Coulomb matrix). Depending on your problem though, you may want to do baseline comparisons with Connectivity features of different depths and that may work well enough. Or to include distances rather than just connectivity, EncodedBonds and EncodedAngles should resolve some scaling issues with the Coulomb matrix variants.

sooheon commented 4 years ago

BagofBonds is just a reshaping of the CM to not repeat information across the diagonal (plus some considerations for padding to total seq length), right? I actually prefer the adjacency matrix shape (NxN), as I'm using it to augment attention between nodes a la MAT

crcollins commented 4 years ago

I see. For stability, you may prefer using the sort=True option for Coulomb matrix. It may help (slightly) if you are not already using it.

I don’t know the details of the model you reference, but in skimming the paper, I am not sure how much the Coulomb matrix will help. It seems like the main difference would be from the reciprocal distance rather than linear distance. Past work has shown that the reciprocal distances are better on their own, but with multimodal data I have found the gains to be as significant. So, if it does perform better, I would suspect that would be the cause.

The function they are applying to the distance matrix seems odd to me, but I guess it works in their case. I am curious how their method handles the permutation problem.

sooheon commented 4 years ago

Permuation is not an issue because the transformer architecture is order agnostic. This is usually a downside for text modeling that they overcome with positional encoding, but works for molecules. Sorting the CM would not help, as the permutation order does not matter, but it does matter that the indices for the coulomb, adjacency, distance matrices and the atom array match up.

The fn applied to the distance matrix is to normalize into (0, 1) range (also to ensure the "weights" per row add up to 1 if softmax is used)

sooheon commented 4 years ago

Reciprocal distance is a good idea, in fact taking the negative and then exponentiating does the same sort of thing (not to scale, but in terms of ordering).

crcollins commented 4 years ago

Hm. I may be wrong, but I was under the impression Transformer models were not permutational invariant. Otherwise, when doing language modeling you would end up with a bag of words model with better embeddings.

For example, these two sentences are opposite due to a word permutation. “the series is promising and should not be missed” “the series is not promising and should be missed”

I do understand this resolves out with higher order terms (and positional encoding), but I don’t think it is true when talking about a 3D structure instead of a 1D sentence. If I remember correct, the bag of bonds paper mentioned different molecules with the same distance matrix.

Again, I haven’t read this paper (http://proceedings.mlr.press/v97/lee19d/lee19d.pdf), but seems it to address the permutation problem for general transformer models.

For the normalization, via softmax, I will say that in some problems in chemistry, normalization severely hurts performance. If the normalization is done at the units level, it wouldn’t though.

With the reciprocal distance, there may be some subtle things here. Sure the ordering may be right, but there are some strong physical bases for 1/x vs exp(-x). For example, Coulomb interactions are physically 1/x which causes problems for some real physics models that approximate them with exponentials due to poor characterization of long range interactions. Ideally, your network would be large enough for the universal function approximator property to kick in, but in that case I am not sure the Coulomb matrix would be adding any new information to the model.

But with all research things like this, who knows. All these things may be completely irrelevant for your particular models/datasets. Haha. Just try lots of things, and iterate, iterate, iterate.

sooheon commented 4 years ago

Otherwise, when doing language modeling you would end up with a bag of words model with better embeddings.

This is why positional encodings are added. By default, attention is solely "content addressed", and operates on unordered sets. Of course when you use its output elements, order may matter, but it also may not if you just mean/max pool. So you're right, transformer without positional encoding is fancy BoW :)

Set transformer looks like multihead attn without positional encoding, optionally a version with reduced complexity (ISAB), and a novel pooling method using attention again.

some strong physical bases for 1/x vs exp(-x)

Interesting. The only issue is self-distance of 0 is undefined, and the range is unbounded for tiny distances. I don't know how close two atoms can get in a molecule, but I wouldn't feel good about potentially getting 1e6 times larger input on some nodes than others.

Just try lots of things, and iterate, iterate, iterate

100%

crcollins commented 4 years ago

Not to dive too deep into this, but there are a lot of interesting phenomena that pop up from these things that are fun to discuss. The details do not matter in aggregate, but they do have a habit of popping up in surprising places.

I was a bit careless with the position description, so you will have to forgive me. What I intended to say is that position of atoms and the permutation of them are two orthogonal axes (and not resolved with a standard transformer model). These are not issues with text because text is 1 dimensional. The positions can change, but permutation is different.

I guess the better analog is with image recognition and shift/scale/rotation invariance. There is a similar thing with molecules, but at an atom level that gets aggregated to the molecule. Atoms don’t “know” about other atoms so to speak, rather they feel the effects of a potential field. Traditionally, this would be handled with some kind of pooling like you mentioned. You can do data augmentation as well, but it doesn’t have the same guarantees.

At a very high level, any molecular model should have a few properties (there are many others one could consider, but I just highlight these):

  1. It should produce the same result independent of the order which atoms/electrons/particles are read into the model.
  2. It should be independent of the chosen coordinate system (this also includes time)

These are nontrivial conditions that have been debated and fought with over the past hundred+ years. Superficially, by using floating point numbers these conditions are already broken due to the lack of associativity. On the ML side, not adhering to these leads to severe overfitting. Which, depending on your task/data, may or not be a bad thing. I say “may” because these are all approximations and you can’t have it all (or at least there is not a grand unified theory, yet!).

The problem with the singularity at zero for 1/x is is nuanced and doesn’t really matter for molecular property prediction. In some sense, the predictions only matter where the values are reasonable. As the distance decreases, the explosion of the numbers are indicative of the rate at which you are rocketing away from “chemistry” (chemistry -> nuclear chemistry/physics -> High energy physics -> ???). I don’t think many drug molecules or paints depend on nuclear fission/fusion, but maybe that is what is holding us back. :)

So, practically, if any atom pair distance is less than 0.5 Å (50 pm) then it is almost certainly bad input data. In fact, most (maybe all?) quantum chemistry software packages will reject molecules with interatomic distances shorter than that. In real terms, the maximum 1/x value would be around 2, though in reality, it would probably be around 1 as values greater than that correspond to special bonds/molecules (HF, H2, etc).

sooheon commented 4 years ago

Great stuff, thanks for diving in depth.

At a very high level, any molecular model should have a few properties (there are many others one could consider, but I just highlight these):

  1. It should produce the same result independent of the order which atoms/electrons/particles are read into the model.
  2. It should be independent of the chosen coordinate system (this also includes time)

AIUI, transformer sans positional encoding has these properties. Transformer + inter-atomic distance encoding has shift/rotation invariance, but does not have scale invariance (unless you force it with preprocessing).

Superficially, by using floating point numbers these conditions are already broken due to the lack of associativity

What do you mean?

singularity at zero for 1/x is is nuanced and doesn’t really matter for molecular property prediction [...]

This is great insight, thanks.

Question: the coulomb matrix considers a pair of atoms, the distance between them, and measured nuclear charge values as input. Is this not interfered with by surrounding atoms? If we're considering two atoms at opposite ends of the molecule, all the intervening atoms kind of get in the way. Or even if there is nothing in between, wouldn't the inherent nuclear charge input value need to be adjusted depending on the kinds of bonds the atom happens to form?

crcollins commented 4 years ago

Superficially, by using floating point numbers these conditions are already broken due to the lack of associativity

What do you mean?

Floats lack associativity due to rounding when doing calculations. This will then yield different results based on the order the model does calculations. This is especially problematic with large dot products that are done in ML models as the errors accumulate.

So, for the conditions listed before. With the first, we can see that if the order of the additions matter, then the order of the atom input will matter. And with the second, these issues become more pronounced at very large and very small numbers (in magnitude) making things coordinate dependent. You may already be familiar with some of these issues coming up in ML with using the log likelihood instead of the likelihood, or from using special functions to compute sigmoids/softmaxes, or as the reason for preferring numbers with mean 0, and std 1. This page describes some of the issues that arise (particularly the Ambiguity section).

Question: the coulomb matrix considers a pair of atoms, the distance between them, and measured nuclear charge values as input. Is this not interfered with by surrounding atoms? If we're considering two atoms at opposite ends of the molecule, all the intervening atoms kind of get in the way. Or even if there is nothing in between, wouldn't the inherent nuclear charge input value need to be adjusted depending on the kinds of bonds the atom happens to form?

As a nit pick, it is not the measured charge used in the Coulomb matrix, just the atomic number. The measured charge would require calculations that are probably a few orders of magnitude more expensive than an ML model (in general).

To your actual question, this is sort of what I mean that atoms don’t really “know” about other atoms (this may be a bit philosophical and debatable). Essentially, each atom can be imagined as floating in some potential field that is induced by the other atoms. If you could remove the other “atoms”, but somehow retain that same field (say with carefully placed magnets), you should expect the same properties.

For the details of that field, that is complex, and the fact that we look at systems with more than 1 electron make it such that it is not directly solvable. So, for the past hundred years, we have been making approximations that are solvable but still produce reasonable results. Now this lineage leads to your efforts with ML, so make ‘em proud.

If you want to see the gory details of that potential this slide shows the Schrödinger equation (what is used in quantum mechanics to compute these properties) which includes the potential (the V term). You will notice the prevalence of the 1/x terms that we discussed before.

To your point about refining the charges, past experience has shown that these values do not add much to predictions as they are highly distance dominated. That being said, there has been tons of work on different refinements to varying degrees of success. I will note though, many properties/molecule sets can be very dependent on the representation. So, it may matter in your case.

Lastly, with the kinds of bonds. Bonds are not really physical things per say, more of a mental model we use to think about molecules. This is especially true when dealing with real molecule coordinates as bond types are basically just coarse grained buckets on the distances between atoms. This is the basis for representations like EncodedBond and other smoothed distance histogram methods because bond types set artificial constraints. Now, if you only have the molecular graph and not the coordinates, then they are a rough proxy for distance. So as with anything with ML, they can be added features/information. Past work has shown reasonable benefits to mixes of different kinds of information like this.