aimat-lab / gcnn_keras

Graph convolutions in Keras with TensorFlow, PyTorch or Jax.
MIT License
107 stars 29 forks source link

how to read directly 3D from a sdf file #59

Closed thegodone closed 2 years ago

thegodone commented 2 years ago

Is there a way to read directly my 3D molecules coordinates structure "mol" rdkit object (based on corina or ballon) from a given sdf file and my targets coming from a given csv file together ?

maybe something like this can be used

from rdkit.Chem import PandasTools fn = 'file.sdf' df = PandasTools.LoadSDF(fn, embedProps=True, molColName=None)

indeed would be nice to be able to convert pandas df mols into graphs

PatReis commented 2 years ago

So what should be possible if you have named them identical like file.csv and file.sdf and you put them in the same folder Folder you can make a MoleculeNetDataset(data_directory="/Folder", file_name="file.csv", dataset_name="Dataset") ontop. Just call read_in_memory directly or prepare_data(overwrite=False) to skip SDF file generation.

PatReis commented 2 years ago

Otherwise I can add a from_sdf_file() to MolecularGraphRDKit if you want, but you would have to set up the dataset manually. MolecularGraphRDKit is the interface to expose graph properties from a rdkit mol object.

thegodone commented 2 years ago

To clarify, I receive ochem molecule as 3d file cause our system (multiples slave delocalised servers) as a distributer computing farm plus a storage of molecules 3d is already seen. Also we use Corina as 3D generator (high reputation pharma based software). So my wish is to be able to get anyway an access to rdkit custom featuring for graph. Your proposal to make a from_sdf_file looks good to me, if I can read simultaneously, both "name".sdf and "name".csv to also read my targets linked to those 3D molecules. the only concern is that all molecules are store in only one unique sdf file as the target in one unique csv file (maybe the solution one is better then cause both are named identical with the two extensions .csv and .sdf ? )

PatReis commented 2 years ago

Okay, so I understand better now, I think. The current dataset classes are all MemoryGraphDatasets so the dataset must fit into memory, we have plans and code fragments for Loaders and tf.Dataset readers that can access some sort of database and load training data on the fly but that is still on the TODO-List. For MemoryGraphDataset it is simply easier to load one file once. But we have the file_dicrectory property so I can add functionality to (beside of smiles) also have a column "file_names" in the file.csv that collects the single 3D files (without generating structures) and stores them into a single .sdf file. Note that you would have to run prepare_data(overwrite=True) if you add mol-files in the file_directory.

Regarding the rdkit custom featuring for graph, the MoleculeNetDataset.set_attributes should be fully customizable. With setting functions in place of string arguments, a custom_callback and I will also add a custom_transform parameter that can modify the molecule before graph extraction.

thegodone commented 2 years ago

Indeed all molecules are preprocessed and delivered in one unique SDF file containing all the molecules by OCHEM as well as a csv file that contains the "targets" & "smiles" line by line. the only issue it may happens that few smiles can be not reconigned by RDkit so my only issue is the alignement of SDF entries and valid smiles.

thegodone commented 2 years ago

I try method 1: overwrite=False

image

PatReis commented 2 years ago

Hey, I can not really tell from the error log if I have not seen the SDF file, but assuming there was one molecule without conformer, and running map before clean, coud lead to this error. I changed the return behaviour of node_coordinates. Can you check again with latest git verison, otherwise you may have to send me the SDF plus CSV file.

PatReis commented 2 years ago

Hey, I tested it with code sniplet below and it worked fine for me.

from kgcnn.data.moleculenet import MoleculeNetDataset

dataset = MoleculeNetDataset(data_directory="Archive", file_name="train.csv")
dataset.prepare_data(overwrite=False)
dataset.read_in_memory(label_column_name="Result0")
dataset.set_attributes()
dataset.map_list(method="set_range", max_distance=4)
print(dataset[0])