datamol-io / graphium

Graphium: Scaling molecular GNNs to infinity.
https://graphium-docs.datamol.io/
Apache License 2.0
197 stars 12 forks source link

Global molecular features support #518

Closed VladVin closed 2 weeks ago

VladVin commented 3 months ago

Hey Graphium community,

I'm training a bunch of models with different featurizers and I came into understanding that there seems to be no support of global molecular features (properties) within the models. At least MolGPS paper mentions the use of global features in MPNN++ but I don't see neither featurizers of that type nor configurations. The closest that I was able to find is this method but it says it's deprecated.

Are global molecular properties somehow represented by virtual nodes and/or I'm missing something?

DomInvivo commented 3 months ago

Hello @VladVin , global features are supported with positional encodings, such as eigenvalues. Right now, they are simply concatenated to all node features. Ideally, they should have their separate encoder going directly in the virtual node. See Issue #234

What kind of global molecular features are you looking for?

VladVin commented 2 months ago

Hey @DomInvivo ,

Thank you for the reply. Update here: I was exploring MPNN++ and GPS++ models, and when I read the GPS++ paper I came into understanding that there's no support of the global features within the MPNN++ model although there's a parameter use_globals in the MPNNPlusPyg module which is never used within the class, see code. If you are saying that the global features are instead supported through the positional encodings, that may make sense, although it's different from what was proposed in the paper.

P.S. Originally, I was thinking that global features are something similar to what is used in the Chemprop architecture, i.e. 200+ RDKit molecular features. But I updated my understanding after reading a series of papers on GNNs: Graphormer, MPNN++, GPS++ and MolGPS.

DomInvivo commented 2 months ago

If you look at the Virtual nodes, they will give you something very similar to what is proposed in the GPS++ paper, they basically pool the nodes and edges, apply an MLP, and concatenate it back into the message passing. It's almost identical to the paper, but the order of operations varies slightly.

https://github.com/datamol-io/graphium/blob/d12df7e06828fa7d7f8792141d058a60b2b2d258/graphium/nn/pyg_layers/pooling_pyg.py#L165

DomInvivo commented 2 weeks ago

I am closing this issue as it was resolved in my last comment. If you believe it was not answered, feel free to re-open it.