CLI support for graph transformation 'pipelines'

kamurani commented 2 years ago

I'm aiming to generate protein graphs in bulk in order to then perform unsupervised clustering on them. I would also like to repeat this process on several different proteomes.

I would also like to apply several intermediate steps (e.g. select subgraph of radius r for each graph; select subgraph of threshold rsa)

So far, I have seen that ProteinGraphDataset retrieves PDB files from a list of ids (either UniProt or PDB accession codes) and downloads from PDB or AF2, and the 'intermediate steps' can be achieved by supplying functions to the graph_transformation_funcs parameter.

However, I would like to use a subset of a proteome (list of IDs) and an already existing set of .pdb files in a directory (as opposed to downloading them again). Would it be possible for a more elegant solution to exist in a similar fashion to the existing command line interface?

I was thinking that some sort of 'pipeline' could be written as a CLI command, perhaps by providing

path to file containing list of protein IDs
Path to directory containing structures (also where new ones will be downloaded if required)
which database to use if UniProt IDs used (e.g. swissprot or AF2)
path to config.yml file for graph construction
path to graph_processing.yml file detailing a list of functions to apply (e.g. subgraph selection)
output path for graphs (can specify format, e.g. nx.Graph or pyg)

This is just my naive idea for now, I haven't fleshed out exactly how it would work; but maybe a way to describe 'transformations' in a processing.yml file in a similar way to the ProteinGraphConfig parser?

I think a framework that allows people to script pipelines (like the one I am trying to make) from the command line would allow for ease of experimentation and simplicity, compared to making it all in python using the low-level functions.

Would appreciate any thoughts on this!

a-r-j commented 2 years ago

Hi @cimranm great suggestion!

To address your immediate problem, I think you can try just passing the filenames (no extension) as the pdb_code arg in ProteinGraphDataset. The download is only triggered if the files are not found in the DATA_DIR/raw directory so if you place your PDBs there it should behave how you want it to.

With respect to pipelining, I think this would be a great feature (and not too tricky to implement). It should be quite straightforward to write a parser for the transformation functions from a Yaml file (see: https://stackoverflow.com/questions/67442071/passing-python-functions-from-yaml-file).

I can provide some support and help implement some of this if you're keen to build this feature. I don't have the bandwidth at the moment to pick this up on my own though.

kamurani commented 2 years ago

Sure, I've already built something like this for my own use case so would be happy to figure out an elegant way to make it generalisable and add it to the graphein CLI. Will let you know if I'm stuck!

a-r-j / graphein

CLI support for graph transformation 'pipelines' #195