Closed rohanvarm closed 10 months ago
Hello @rohanvarm. The model is a pure decoder-only transformer, trained without any representation learning objective, so I will not recommend using the embeddings for any molecular prediction task (especially if you are not finetuning it).
Nevertheless, using molfeat, it's easy to have a new MolecularTransformer to extract the representation. I have pushed a detailed notebook here as an example: https://github.com/datamol-io/safe/blob/529003341ea75a50c172a9592307b1cc2f46b595/docs/tutorials/extracting-representation-molfeat.ipynb
I would recommend using one of the many pretrained models in molfeat instead !
Best.
Hi, I see that does make sense. What about some sort of a vector DB consisting of say all of zinc, could you not use it then as the discriminator? I'm just trying to think of potential use-cases beyond molecular design
If your goal is something like similarity search or some form of fragment-based virtual screening, this might actually be quite promising.
I suspect, finetuning with some predictive tasks could also enhance the representation quality or lead to fair predictive models with the added benefit of providing fragment-based explainability. Those are still speculations that I would personally like to investigate when time allows.
Should you experiment with any of these ideas or try something different, I would be grateful if you could share your findings with me.
Hi, What is the easiest way to extract the embedding layer so we can get a 768-dim or x-dim for every molecule ?