DeepGraphLearning / torchdrug

A powerful and flexible machine learning platform for drug discovery
https://torchdrug.ai/
Apache License 2.0
1.42k stars 198 forks source link

ProteinLigandDataset.loadsequence optimization #138

Open peiyaoli opened 1 year ago

peiyaoli commented 1 year ago

Hi, thanks for the contribution on this library, especially the TorchProtein part.

One suggestion would be loadsequence optimization. In current implementation:

 for i, (sequence, smile) in enumerate(zip(sequences, smiles)):
            if i >= cum_num_samples[_cur_split]:
                _cur_split += 1
            if not self.lazy or len(self.data) == 0:
                protein = data.Protein.from_sequence(sequence, **kwargs)
                mol = Chem.MolFromSmiles(smile)
                if not mol:
                    logger.debug("Can't construct molecule from SMILES `%s`. Ignore this sample." % smile)
                    num_samples[_cur_split] -= 1
                    continue
                mol = data.Molecule.from_molecule(mol)
            else:
                protein = None
                mol = None
            if attributes is not None:
                with protein.graph():
                    for field in attributes:
                        setattr(protein, field, attributes[field][i])
            self.data.append([protein, mol])
            self.sequences.append(sequence)
            self.smiles.append(smile)
            for field in targets:
                self.targets[field].append(targets[field][i])

we have to iterate all the samples and generate the protein and mol.

If this comes from biochemical datasets, in which compounds and proteins have lots of combinations, then the code makes sense.

However, for some special cases, for instance the kinase profiling datasets. Only hundreds of protein sequences and drug protein would be included, but the samples would be ~100K. Then this solution is too slow.

One suggestion is to create a dictionary for all proteins and compounds. If it has been converted, then no more computation is required.

Oxer11 commented 1 year ago

Hi!

Thanks very much for the useful suggestion! You're right that the memory can be saved by loading proteins and ligands once and use dictionaries to index them. We indeed want to do this for some other huge datasets. For protein-ligand datasets, we think that they are small enough to handle even with the current implementation, since we only construct protein sequences rather than structures.

We'll consider how to improve the implementation in the future development. Thanks very much!