crcollins / molml

A library to interface molecules and machine learning.
MIT License
65 stars 17 forks source link

Error in BagOfBounds #2

Closed rcprati closed 5 years ago

rcprati commented 5 years ago

Hi,

I think there is also a problem with your BoB implementation. When you compute the mask, you use the outer product of first and second element Boolean vectors. However, this does not produce a symmetrical matrix, as you are supposing. Consider this example:

a = numpy.array([0,1,0,0]) b = numpy.array([1,0,0,1]) numpy.outer(a,b)

This mean that one atom represented in a is connected with two atoms represented in b. However, when you compute the outer product, you get the matrix:

array([[0, 0, 0, 0], [1, 0, 0, 1], [0, 0, 0, 0], [0, 0, 0, 0]])

And if you take the upper or lower triangular, you get only one connection and loose the other. In order to get a symmetrical matrix, you need to consider the both sides of the outer product

numpy.outer(a,b)+numpy.outer(b,a)

getting

array([[0, 1, 0, 0], [1, 0, 0, 1], [0, 0, 0, 0], [0, 1, 0, 0]])

So you can take the upper or lower triangular matrix to form the mask

crcollins commented 5 years ago

Yep, good catch. Adding a fix for this now!

Thanks!

rcprati commented 5 years ago

Never mind.

I am implementing these methods in R (so I can better understand them) and using yours to validade, and this one was not matching the output. Thanks for the good work by the way.

Regards,

Em sex, 24 de mai de 2019 às 20:37, Chris Collins notifications@github.com escreveu:

Yep, good catch. Adding a fix for this now!

Thanks!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/crcollins/molml/issues/2?email_source=notifications&email_token=AA3RSUZUHSTVENZZKW2M7O3PXB34JA5CNFSM4HPS74U2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWGYNSA#issuecomment-495814344, or mute the thread https://github.com/notifications/unsubscribe-auth/AA3RSU2DHP72G6AGBVT74QTPXB34JANCNFSM4HPS74UQ .

crcollins commented 5 years ago

Actually, this was a bit more nuanced. The issue that you mentioned is actually does not matter for this setting. But it is only because the elements are being sorted. As in, there are no cases where the lower triangle was getting populated because the true values in the second vector will always have the same index or higher than the first vector (restricting it to the upper triangle).

Now that being said, in looking back at this code, it looks like I added a bug in fd06dfbd by not using the sorted values for the Coulomb Matrix computation. Which as you can probably guess, is a bad thing.

So again, thanks for bringing this up. Otherwise, I would have not found that issue for a long time! The "cosmetic" fix has been added in 6ac5cae2.

rcprati commented 5 years ago

I see your point.

I tried to fix my own version with my suggestion, but I realized that the bug is due to sorting. On line 1314, you need to pass a sorted version of data.numbers, otherwise you will compute the wrong coulomb matrix. This is my fix (which now matches my implementation):

    temp = sorted(zip(data.elements, data.coords,data.numbers), key=lambda x: x[0])
    elements, coords, numbers = zip(*temp)

    bags = {k: [0 for i in range(v)] for k, v in self._bag_sizes}
    coulomb_matrix = get_coulomb_matrix(numbers, coords)

Regards.

crcollins commented 5 years ago

Yeah, that is now fixed with 5adb5e7d. I got pretty turned around with this when I noticed that it was performing much worse with the added outer product change, which did not make sense. So I checked both, and they were the same... so I was really confused until I saw that typo.

Best of luck with learning these things. Let me know if you have any questions or if you find any more of my silly bugs. I am sure there are more. :)

rcprati commented 5 years ago

Thanks,

A last point in BagOfBounds. The paper

https://pubs.acs.org/doi/pdf/10.1021/acs.jpclett.5b00831

also seems to include bags for the main diagonal. Although it is not highlighted in gray in 3(b), in Figure 3(c) and 3(d) they are included.

Regards.

crcollins commented 5 years ago

Yeah, that is an option that I did not include in this implementation. I have it in other versions of BoB that I have written. The diagonal terms seem to only make predictions worse. Really, those terms end up being a really poor approximation of atom counts, and in practice those counts are probably better. There are other modifications that can be made (ex: depth limiting the interaction terms, tuning the decay function, etc) that can also be added but whether or not it is worthwhile to have is debatable.

When this code was originally written, I had the philosophy of minimizing these kinds of choices to give the best overall versions with hyperparamters, but over time I have allowed MolML to gain a bit of bloat in that regard. So, I guess I will add these things back in at some point in the near future.