embedding layer - Githubissues

datamol-io / safe

A single model for all your molecular design tasks

https://safe-docs.datamol.io/

Apache License 2.0

86 stars 8 forks source link

embedding layer #18

Closed rohanvarm closed 10 months ago

rohanvarm commented 11 months ago

Hi, What is the easiest way to extract the embedding layer so we can get a 768-dim or x-dim for every molecule ?

maclandrol commented 11 months ago

Hello @rohanvarm. The model is a pure decoder-only transformer, trained without any representation learning objective, so I will not recommend using the embeddings for any molecular prediction task (especially if you are not finetuning it).

Nevertheless, using molfeat, it's easy to have a new MolecularTransformer to extract the representation. I have pushed a detailed notebook here as an example: https://github.com/datamol-io/safe/blob/529003341ea75a50c172a9592307b1cc2f46b595/docs/tutorials/extracting-representation-molfeat.ipynb

I would recommend using one of the many pretrained models in molfeat instead !

Best.

rohanvarm commented 11 months ago

Hi, I see that does make sense. What about some sort of a vector DB consisting of say all of zinc, could you not use it then as the discriminator? I'm just trying to think of potential use-cases beyond molecular design

maclandrol commented 11 months ago

If your goal is something like similarity search or some form of fragment-based virtual screening, this might actually be quite promising.

I suspect, finetuning with some predictive tasks could also enhance the representation quality or lead to fair predictive models with the added benefit of providing fragment-based explainability. Those are still speculations that I would personally like to investigate when time allows.

Should you experiment with any of these ideas or try something different, I would be grateful if you could share your findings with me.