Assuming the worst case scenario (using all of the uniprot predictions, approx 200m structures iirc), mapping graph/node labels into Data/Protein objects would have to be done when we get a structure so they're not stored in memory (200m node label tensors seems.. prohibitive). I think the way to go is to connect this to an (optional)LMDB which could also store additional pre-computed features. Thus when we get a structure we pull in these additional data and store them in the returned Data/Protein.
FWIW, I see this functionality as complementary to the other strand of dataset creation we've been doing in #272 . Essentially, I think a model workflow looks like: make a dataset selection with a Manager -> Instantiate a FoldCompDataset -> wrap it in a LightningModule (optional).
I also saw as of 0.0.3 FoldComp supports multi-chain structures. I'm not sure if this now expands support to "real" (i.e. from the PDB) PDB files, but if it does this is something to strongly consider in #272 as an export option.
Assuming the worst case scenario (using all of the uniprot predictions, approx 200m structures iirc), mapping graph/node labels into
Data
/Protein
objects would have to be done when weget
a structure so they're not stored in memory (200m node label tensors seems.. prohibitive). I think the way to go is to connect this to an (optional)LMDB
which could also store additional pre-computed features. Thus when we get a structure we pull in these additional data and store them in the returned Data/Protein.FWIW, I see this functionality as complementary to the other strand of dataset creation we've been doing in #272 . Essentially, I think a model workflow looks like: make a dataset selection with a
Manager
-> Instantiate aFoldCompDataset
-> wrap it in aLightningModule
(optional).I also saw as of
0.0.3
FoldComp supports multi-chain structures. I'm not sure if this now expands support to "real" (i.e. from the PDB) PDB files, but if it does this is something to strongly consider in #272 as an export option.Originally posted by @a-r-j in #284