Open kamurani opened 2 years ago
Hi @cimranm great suggestion!
To address your immediate problem, I think you can try just passing the filenames (no extension) as the pdb_code
arg in ProteinGraphDataset
. The download is only triggered if the files are not found in the DATA_DIR/raw
directory so if you place your PDBs there it should behave how you want it to.
With respect to pipelining, I think this would be a great feature (and not too tricky to implement). It should be quite straightforward to write a parser for the transformation functions from a Yaml file (see: https://stackoverflow.com/questions/67442071/passing-python-functions-from-yaml-file).
I can provide some support and help implement some of this if you're keen to build this feature. I don't have the bandwidth at the moment to pick this up on my own though.
Sure, I've already built something like this for my own use case so would be happy to figure out an elegant way to make it generalisable and add it to the graphein CLI. Will let you know if I'm stuck!
I'm aiming to generate protein graphs in bulk in order to then perform unsupervised clustering on them. I would also like to repeat this process on several different proteomes.
I would also like to apply several intermediate steps (e.g. select subgraph of radius
r
for each graph; select subgraph of thresholdrsa
)So far, I have seen that
ProteinGraphDataset
retrieves PDB files from a list ofid
s (either UniProt or PDB accession codes) and downloads from PDB or AF2, and the 'intermediate steps' can be achieved by supplying functions to thegraph_transformation_funcs
parameter.However, I would like to use a subset of a proteome (list of IDs) and an already existing set of
.pdb
files in a directory (as opposed to downloading them again). Would it be possible for a more elegant solution to exist in a similar fashion to the existing command line interface?I was thinking that some sort of 'pipeline' could be written as a CLI command, perhaps by providing
config.yml
file for graph constructiongraph_processing.yml
file detailing a list of functions to apply (e.g. subgraph selection)This is just my naive idea for now, I haven't fleshed out exactly how it would work; but maybe a way to describe 'transformations' in a
processing.yml
file in a similar way to theProteinGraphConfig
parser?I think a framework that allows people to script pipelines (like the one I am trying to make) from the command line would allow for ease of experimentation and simplicity, compared to making it all in python using the low-level functions.
Would appreciate any thoughts on this!