Open peiyaoli opened 1 year ago
Hi!
Thanks very much for the useful suggestion! You're right that the memory can be saved by loading proteins and ligands once and use dictionaries to index them. We indeed want to do this for some other huge datasets. For protein-ligand datasets, we think that they are small enough to handle even with the current implementation, since we only construct protein sequences rather than structures.
We'll consider how to improve the implementation in the future development. Thanks very much!
Hi, thanks for the contribution on this library, especially the TorchProtein part.
One suggestion would be loadsequence optimization. In current implementation:
we have to iterate all the samples and generate the
protein
andmol
.If this comes from biochemical datasets, in which compounds and proteins have lots of combinations, then the code makes sense.
However, for some special cases, for instance the kinase profiling datasets. Only hundreds of protein sequences and drug protein would be included, but the samples would be ~100K. Then this solution is too slow.
One suggestion is to create a dictionary for all proteins and compounds. If it has been converted, then no more computation is required.