dmamur / elembert

17 stars 3 forks source link

How to generate types of atoms from smiles? #3

Closed pnhuy closed 1 month ago

pnhuy commented 1 month ago

Dear team,

Thank you very much for your contribution.

When run your notebooks, I wonder how to generate the types (v0, v1, v2) in your dataset?

I saw the function getRawInputs which use ase Atoms data structure. But I don't know how to generate ase Atoms?

Some solutions I found suggest to generate the conformation using rdkit EmbedMolecule:

Smiles --[rdkit]--> Mol --[EmbedMolecule]--> Conformation --[ase]--> Atoms --> Types

But I failed in many cases because rdkit was unable to generate the conformation? e.g: the smiles CCOC(=O)CCC(C)=O in toxic_nr-aromatase_ds.csv.

Could you please explain more on types generation?

Thank you very much!

2shakir commented 1 month ago

Hi,

Thank you for your question. V0 is simply a list of atomic symbols in the compound. For example, for CH4, V0 would be: C,H,H,H,H. You can find examples at this link. V1 represents unsupervised classified subtypes. Examples can be found here. V2 is presented as an example. We can increase the number of subtypes without relation to the oxidation states, which increases the number of model parameters. This will be described in further works. If RDKit cannot generate atomic symbols, it's better to use Open Babel or any other tools.

pnhuy commented 1 month ago

Hi @2shakir , Thank you very much, you answer makes sense to me.