Closed thegodone closed 2 years ago
So what should be possible if you have named them identical like file.csv
and file.sdf
and you put them in the same folder Folder
you can make a MoleculeNetDataset(data_directory="/Folder", file_name="file.csv", dataset_name="Dataset")
ontop. Just call read_in_memory
directly or prepare_data(overwrite=False)
to skip SDF file generation.
Otherwise I can add a from_sdf_file()
to MolecularGraphRDKit
if you want, but you would have to set up the dataset manually. MolecularGraphRDKit
is the interface to expose graph properties from a rdkit mol object.
To clarify, I receive ochem molecule as 3d file cause our system (multiples slave delocalised servers) as a distributer computing farm plus a storage of molecules 3d is already seen. Also we use Corina as 3D generator (high reputation pharma based software). So my wish is to be able to get anyway an access to rdkit custom featuring for graph. Your proposal to make a from_sdf_file looks good to me, if I can read simultaneously, both "name".sdf and "name".csv to also read my targets linked to those 3D molecules. the only concern is that all molecules are store in only one unique sdf file as the target in one unique csv file (maybe the solution one is better then cause both are named identical with the two extensions .csv and .sdf ? )
Okay, so I understand better now, I think. The current dataset classes are all MemoryGraphDataset
s so the dataset must fit into memory, we have plans and code fragments for Loaders and tf.Dataset readers that can access some sort of database and load training data on the fly but that is still on the TODO-List. For MemoryGraphDataset
it is simply easier to load one file once. But we have the file_dicrectory
property so I can add functionality to (beside of smiles) also have a column "file_names" in the file.csv
that collects the single 3D files (without generating structures) and stores them into a single .sdf file. Note that you would have to run prepare_data(overwrite=True)
if you add mol-files in the file_directory.
Regarding the rdkit custom featuring for graph, the MoleculeNetDataset.set_attributes
should be fully customizable. With setting functions in place of string arguments, a custom_callback and I will also add a custom_transform parameter that can modify the molecule before graph extraction.
Indeed all molecules are preprocessed and delivered in one unique SDF file containing all the molecules by OCHEM as well as a csv file that contains the "targets" & "smiles" line by line. the only issue it may happens that few smiles can be not reconigned by RDkit so my only issue is the alignement of SDF entries and valid smiles.
I try method 1: overwrite=False
Hey, I can not really tell from the error log if I have not seen the SDF file, but assuming there was one molecule without conformer, and running map before clean, coud lead to this error. I changed the return behaviour of node_coordinates
.
Can you check again with latest git verison, otherwise you may have to send me the SDF plus CSV file.
Hey, I tested it with code sniplet below and it worked fine for me.
from kgcnn.data.moleculenet import MoleculeNetDataset
dataset = MoleculeNetDataset(data_directory="Archive", file_name="train.csv")
dataset.prepare_data(overwrite=False)
dataset.read_in_memory(label_column_name="Result0")
dataset.set_attributes()
dataset.map_list(method="set_range", max_distance=4)
print(dataset[0])
Is there a way to read directly my 3D molecules coordinates structure "mol" rdkit object (based on corina or ballon) from a given sdf file and my targets coming from a given csv file together ?
maybe something like this can be used
from rdkit.Chem import PandasTools fn = 'file.sdf' df = PandasTools.LoadSDF(fn, embedProps=True, molColName=None)
indeed would be nice to be able to convert pandas df mols into graphs