ProteinLigandDataset.loadsequence optimization

Hi, thanks for the contribution on this library, especially the TorchProtein part.

One suggestion would be loadsequence optimization. In current implementation:

 for i, (sequence, smile) in enumerate(zip(sequences, smiles)):
            if i >= cum_num_samples[_cur_split]:
                _cur_split += 1
            if not self.lazy or len(self.data) == 0:
                protein = data.Protein.from_sequence(sequence, **kwargs)
                mol = Chem.MolFromSmiles(smile)
                if not mol:
                    logger.debug("Can't construct molecule from SMILES `%s`. Ignore this sample." % smile)
                    num_samples[_cur_split] -= 1
                    continue
                mol = data.Molecule.from_molecule(mol)
            else:
                protein = None
                mol = None
            if attributes is not None:
                with protein.graph():
                    for field in attributes:
                        setattr(protein, field, attributes[field][i])
            self.data.append([protein, mol])
            self.sequences.append(sequence)
            self.smiles.append(smile)
            for field in targets:
                self.targets[field].append(targets[field][i])

we have to iterate all the samples and generate the protein and mol.

If this comes from biochemical datasets, in which compounds and proteins have lots of combinations, then the code makes sense.

However, for some special cases, for instance the kinase profiling datasets. Only hundreds of protein sequences and drug protein would be included, but the samples would be ~100K. Then this solution is too slow.

One suggestion is to create a dictionary for all proteins and compounds. If it has been converted, then no more computation is required.

DeepGraphLearning / torchdrug

ProteinLigandDataset.loadsequence optimization #138