Graphein Design Doc - Githubissues

Hey @a-r-j, just pinging back here as promised yesterday with what probably could be considered a design doc for Graphein. I'm excited about this! I also know that my memory is gradually degrading as I get older, so please correct me if I'm wrong anywhere, and don't hesitate to directly edit my post.

Authors

Arian Jamasb
Eric J. Ma

Introduction

Graphein is an awesome idea: basically, a package that flexibly handles the construction of graph representations of biological data, with the goal of democratizing access of biological graph data to all (machine learners and biologists alike).

In its current state, the package might be too opinionated on what backing packages are used, which affects the portability of the data structures. We would like to propose an improved software design that leverages the ecosystem of the Python data science stack.

Graph Preliminaries

Graphs can be represented in both object form and array form, and this gives us different things.

In the object form, we have node lists and metadata, as well as edge lists and metadata. NetworkX graph objects are one example of this. The object form is rich and flexible - it can store information of arbitrary depth. It can store arbitrary key-value paired metadata. Any hashable object can be a node, as long as the hashable object shows up only once in the data structure. Metadata values can be any Python object. This level of flexibility allows us to construct very human-readable graphs in Python.

In the array form, we have node feature arrays and adjacency matrix-like matrices. This pair of objects is more restricted. For downstream deep learning purposes, the data types of these arrays have to be numeric in nature. Nodes aren't easily indexed by a string name; integer indexing is required. Human auditability of the internal data structure is hampered by the fact that the data are represented in numbers. The metadata are also not easily indexed by string names. That said, the array form is structured and efficient for computation.

In terms of a principled data preparation flow:

We ought to make it as easy as possible for users to inspect their data early on
The raw-est, most "intrinsic" form of data ought to be established and fixed as early on in the data flow
The derivative metadata (which can be calculated or looked up based on the intrinsic metadata) should come in a separate step afterwards.

Design goals and implications

Goal 1: Maximal compatibility with the rest of the Python data science stack.
Goal 2: Easy human auditability for the graphs.

Implication: Target NetworkX graphs, Pandas DataFrames, XArray DataArrays before dispatching/converting to specialized data structures like DGL graphs. This is in line with democratization, as these are the idiomatic and easily accessible data structures of the PyData community.

Goal 3: Follow principled data preparation flow

Implication: Cleanly separate data loading from calculation of derivative data (e.g. things that can be looked up in a lookup table, or other simple/complex calculations), and avoid mixing them up in the same step.

Goal 4: A simple installation protocol that is conda install graphein

Implication: Simplify the installation instructions to follow PyData community idioms as much as possible.

Goal 5: Enable flexibility in programming model to new categories of edges, create adjacency-like matrices of different kinds, and compute new ways of featurizing nodes
Goal 6: Allow easy and opinionated configuration (like black) to construct these graphs.

Implication: APIs that work at different levels are needed. A high-level API for ease-of-use that targets an opinionated subset of the range of possible edge types, adjacency-like matrices, and node features, leveraging a lower-level API that power users can drop down to for customization of how they want to construct a graph for their biological data. This will prevent Graphein from being a god-like package that makes things very easy for end-users, but hamstrings power users from customizing things.

Lower Level API

The key categories of steps that we probably need to be concerned about here are as follows:

[x] 1. Reading PDB files into dataframes - handled by BioPandas
[x] 2. Inserting nodes into a NetworkX graph object from PDB DataFrames, while also annotating their intrinsic properties -(e.g. amino acid identity, xyz coordinates).
[x] 3. Annotating additional, external but nonetheless "intrinsic" node metadata into a NetworkX graph object based on their intrinsic properties, based on a collection of rules encoded as Python functions.
[x] 4. Inserting edges into the same NetworkX graph object based on a collection of rules encoded as Python functions, while also annotating their intrinsic properties (e.g. "edge type" -- hydrogen bond? disulfide bridges? etc.)
[ ] 5. Conversion of node data into node feature dataframes, using a collection of functions to perform the processing.
[ ] 6. Conversion of edges into adacency-like XArrays, using a collection of functions to perform the processing of data.
[ ] 7. Optional conversion into raw NumPy arrays for both of the above.

Together, these probably form the low-level API, on which the high-level convenience API can be built, which essentially provides a mapping from keyword arguments to function call dispatches underneath the hood.

In a bit more detail, I'd like to propose additional structure for the steps.

Construction of NetworkX graphs from PDB DataFrames

This should be opinionated, and only have two flags: whether to use an "atom" graph, or whether to use a "biological unit" graph (amino acids for proteins, nucleotides for RNA/DNA).

Annotating node metadata

The pattern would be a dispatching function, possibly named annotate_node_metadata(G, funcs). funcs is a list of functions that gets looped over and called on for each node:

def annotate_node_metadata(G, funcs):
    for func in funcs:
        for n in G.nodes():
            func(G, n)
    return G

Each func modifies G in-place, and has a uniform signature of (G, n):

def annotate_node_metadata(G, n):
    G.node[n]["metadata_field"] = some_value
    return G

### Inserting edges

The edges that are represented in a graph also form a "hyperparameter" of sorts, i.e. a thing that needs to be configured. Inserting edges of different types will probably be based off custom calculations from the PDB dataframe. However, because this might open an area of research, i.e. what types of edges should be present in a graph for the message passing step, the edge type inserted into a graph can be specified in Python code using custom functions that have a common signature:

```python
def add_some_edge_type(G):
    # custom logic that leverages NetworkX's
    for element in some_iterator:
        kwargs = ... # some custom logic
        G.add_edge(u, v, **kwargs)
    return G

Now, we can have a parent function that loops over a bunch of add_some_edge_type(G) class of functions. In a functional programming paradigm, this can be partialled to some defaults for a bunch of default configuration settings:

def compute_edges(G, funcs):
    for func in funcs:
        func(G)
    return G

If the custom function relies on external data, that can be partialled out:

from functools import partial

def add_some_edge_type(G, some_external_data):
    # stuff
    return G

funcs = [
    partial(add_some_edge_type, some_external_data=some_external_data),
]

This takes care of making the function signature uniformly compatible with the `compute_edges` func below!

### Construction of node feature matrix

Now, we explore how we can construct the node feature dataframe.

```python
def generate_feature_dataframe(G, funcs):
    matrix = []
    for n, d in G.nodes(data=True):
        series = []
        for func in funcs:
            res = func(n, d)
            if res.name != n:
                raise NameError(
                    f"function {func.__name__} returns a series "
                    "that is not named after the node."
                )
            series.append(res)
        matrix.append(pd.concat(series))

    df = pd.DataFrame(matrix)
    if return_array:
        return df.values
    return df

In this design, we ask that functions have again a uniform signature, and return a pandas Series. Then, construction of a DataFrame can happen. The return type of the Series should be numeric (float or int), and not be null. This guarantees that the DataFrame should only contain numeric values. We can then freely convert between the dataframe form and the array form.

Construction of adjacency-like matrices

Now, we explore how we can construct the adjacency-like matrices.

It's possible to construct a wide range of adjacency matrices on which message passing can be performed. For example, one might want to do message passing on the 1-degree adjacency matrix, or do it on a 2-degree adjacency matrix where anything within two degrees of separation are given a "1". One may want to leverage the graph Laplacian matrix too. As such, more than just a matrix, we might want to generate an adjacency tensor. A proposed function we could include in the library, or propose to the NetworkX library, is as follows:

def generate_adjacency_tensor(
    G: nx.Graph, funcs: List[Callable], return_array=False
) -> xr.DataArray:
    mats = []
    for func in funcs:
        mats.append(func(G))
    da = xr.concat(mats, dim="name")
    if return_array:
        return da.data
    return da

In doing this, we generate an XArray 3D tensor of adjacency-like matrices. A prerequisite is that each adjacency matrix generated from func should be an XArray DataArray with one tensor dimension being called "name", which houses the name of the adjacency-like matrix. This conversion can be made easier by providing one more function:

def format_adjacency(G: nx.Graph, adj: np.ndarray, name: str) -> xr.DataArray:
    expected_shape = (len(G), len(G))
    if adj.shape != expected_shape:
        raise ValueError(
            "Adjacency matrix is not shaped correctly, "
            f"should be of shape {expected_shape}, "
            f"instead got shape {adj.shape}."
        )
    adj = np.expand_dims(adj, axis=-1)
    nodes = list(G.nodes())
    return xr.DataArray(
        adj,
        dims=["n1", "n2", "name"],
        coords={"n1": nodes, "n2": nodes, "name": [name]},
    )

Now, an end-user can calculate a NumPy array any way they like, but then name them easily and stick them into a uniform data container, such as an XArray DataArray, that allows them to inspect:

Whether the adjacency-like entries between any pair of nodes is correct
Whether the adjacency-like matrix looks correct or not

A brief pause...

We have spent a lot of time not on the fancy deep learning piece, for a very good reason - because graphs are such a flexible data structure, we have to have patterns in place to help build compatibility between different graph types. By targeting common data structures, like NetworkX graph objects, NumPy arrays, Pandas DataFrames and XArray DataArrays, we ensure native compatibility with a very broad swathe of data science computing environments. This fulfills the goal of democratization; use of Pandas and XArray ensures that we have the ability to easily inspect the array forms of the data before we use it for any deep learning purposes. And by not reinventing the wheel, we lessen the maintenance burden on ourselves.

The High Level API

We'll now go into how the high level API can look like. With a dummy configuration that looks something like this:

graph_config = GrapheinConfig(
    graph_type="amino_acid",  # or "atom"
    edges=["adjacency-1", "adjacency-2", "laplacian"],  # and more for defaults
    node_features=["molecular_weight", "pKa", "pI"],  # and more for defaults
    # And possibly more here!
)

F, A = graphein.read_pdb("/path/to/file.pdb", config=graph_config)

This is what we think the high-level API might look like. It's simple (users only have to remember read_pdb and GrapheinConfig), comes with a lot of sane defaults in the Config object, and dispatches to the right routine.

Lessons learned from reimplementing UniRep tells me (EM) that having users manually write code to preprocess data into tensor form introduces the possibility that they could unexpectedly generate an adversarial-like example that has all the right tensor properties (shape, values, no nulls), but is semantically incorrect because the values are preprocessed incorrectly. Hence, a high level API that accepts the idiomatic data structure (in this case, a PDB file, just as in the case of a protein sequence, a string is the idiomatic data structure) that also handles any preprocessing correctly is going to be very enabling for reproducibility purposes.

Hi Eric, thanks for writing this up. It's super comprehensive! You definitely took better notes than I did.

I'm going to start refactorising the protein graph component this week into the functional paradigm outlined here. I don't think it'll be too taxing. I think we can continue on the dev branch. I'll make a PR to add the basic plotting functionality that exists there to the extant version of graphein so that it's available to people already using graphein until we're happy to share v0.1.0!

I've added some checkboxes to your comment so we can keep track of where we're at. I'll ping you back once it's in a state to start writing some tests

I'm really excited to work with you on this! It's a great chance to learn and hopefully we can build something much more useful for the community.

@a-r-j I made a call to stick those three helper functions inside NetworkX - I think this might help simplify Graphein a bit more, so we can have laser-like focus on functions that take a node and return a pandas Series of amino acid scales. (Here's the PR to NetworkX, btw.)

One resource a colleague pointed me to is EXPASY's amino acid scales. There, we have a ton of amino acid descriptors (something like 50+), which can be used to generate an amino acid node feature set. I put up the script up as a gist, took us about an hour to parse the webpages, but eventually "webpage == API" became a reality :smile:. I think it should be fair game to bundle up the CSV file inside Graphein (kinda like, "just distribute it"), and then use that CSV file in conjunction with the following node-level function:

aa_feats = pd.read_csv("amino_acid_properties.csv", index_col=0)
def featurize_amino_acid(n, d, aa_feats):
    aa = d["residue_name"]
    feats = pd.Series(aa_feats[aa], name=n)
    return feats

We can also break up the function a bit too, making it configurable for an end-user "what features should be used to describe an amino acid". With the generate_X functions in NetworkX, and with this baseline, I think it'll be a great starting point, and the interoperability with the rest of the PyData ecosystem can really strengthen the impact of the package.

@ericmjl I've added the datafile for the Meiler embeddings and a loading function. Something we need to think about is how we handle non-standard/modified amino acids. Those that I'm aware of:

MSE,PTR,TPO,SEP,KCX,LLP,PCA,CSO,CAS,CAF,CSD

Biological data is never as straightforward as we'd like 😅

What I've done previously is to use the values for the 'parent' amino acid (e.g. using the values for MET for MSE). This is obviously a bit dirty and not entirely valid but I'm sure what we can do short of generating the values for the non-standard AAs. Perhaps we could provide some options for doing that workaround, initialising the features with 0s or the mean values.

In any case, we'll need a way to handle non-standard residues in the lookup functions to avoid throwing errors.

Biological data is never as straightforward as we'd like 😅

:joy: Biology breaks all rules indeed! 😅

I wonder what you think about this: for now, we stick to just the standard amino acids? Imposing this first constraint helps with shipping something useful, which is good for publication purposes too, before expanding out to more edge-y cases.

Are non-standard amino acids common in PDB structures? Just wondering; if they're only a small fraction, then leaving them temporarily out of scope while we give more thought to this special category of amino acids, and let time help crystallize out its relations to standard amino acids, could help avoid local minima traps with code structure/programming idioms.

If it helps, I've been thinking about how to handle non-standard amino acids in jax-unirep as well. Because non-standard a.a.s are a human-generated thing, like all human-generated things there's going to be a ton of ways other humans won't conform to "standard".

I think it's less of a problem for human proteins.

My understanding is that they're more common (but still relatively rare) in bacteria and archaea. I think there are two main sources for them:

Genetically encoded (CSE, PYL), sometimes called the 21st and 22nd amino acids which are present in all domains of life iirc. These would be more of a priority.
Modification of standard residues by crystallographers to aid crystallization

I don't have a full set of numbers for how common they are, but as I said, I've run into this problem on certain standard datasets I've worked on like PDBBind. This page says there are >700 structures with D-amino acids and 5 with PYL. That's probably a good argument for ignoring them for the time being.

I guess the easiest solution is to flag these proteins and users can decide whether or not they are happy to exclude them from their datasets.

If we're happy to proceed on the basis of ignoring them, I'll add a valid_protein() function to do some sanity checks on the structures/sequence before the graph construction

Genetically encoded (CSE, PYL), sometimes called the 21st and 22nd amino acids which are present in all domains of life iirc. These would be more of a priority.

I learned something new today! That and D-amino acids being present in crystal structures too. This is cool. Is there a special representation in PDB files for a D-amino acid? If so, it's worth adding a featurization function called "amino_acid_chirality", perhaps?

If we're happy to proceed on the basis of ignoring them, I'll add a valid_protein() function to do some sanity checks on the structures/sequence before the graph construction

I think that'll help, especially in the early stages when the package is a bit more constrained in scope.

I had a quick look at the PDB file for 5E5T. It seems they define the D-amino acids as HETATMS in the header:

HETNAM     DSN D-SERINE
HETNAM     DSG D-ASPARAGINE
HETNAM     DPN D-PHENYLALANINE
HETNAM     DCY D-CYSTEINE
HETNAM     DAS D-ASPARTIC ACID
HETNAM     DLY D-LYSINE
HETNAM     DLE D-LEUCINE
HETNAM     DAR D-ARGININE
HETNAM     DAL D-ALANINE
HETNAM     DTY D-TYROSINE
HETNAM     DIL D-ISOLEUCINE
HETNAM     DGL D-GLUTAMIC ACID
HETNAM     DVA D-VALINE
HETNAM     DPR D-PROLINE
HETNAM     DTH D-THREONINE
HETNAM     DHI D-HISTIDINE
HETNAM     FMT FORMIC ACID
HETNAM     EDO 1,2-ETHANEDIOL
HETSYN     EDO ETHYLENE GLYCOL

The PDB maintains a list of chemical entities and their abbreviations so it should be a fairly straightforward lookup. Hopefully, these conventions are standard in practice. We'd probably want to label non-standard residues differently to the other hetatms in the dataframe to allow for more nuanced filtering E.g. keep non-standard residues but remove all the water molecules. In any case, I think we should probably keep one of those dictionaries in graphein for easy conversion between PDB ligand identifiers and SMILES/InCHI, though not a priority at this point.

On that page there's also a list of the PDB structures that contain each ligand which might be useful, though I'm not sure how regularly updated it is.

Lemme raise an issue for that point you made above, so it's trackable in the pseudo-todo list that is the Issue tracker.

Hey, @ericmjl. Happy Holidays!

I'm back on Graphein (mostly) full-time now. I've added some functionality for computing sequence-level embeddings from some pre-trained language models (ESM, BioVec). I plan to add a bunch from other sources (eg DSSP, propy, etc). @cch1999 contributed some edge computing functions so the we have a MVP of the protein graph construction now.

The outstanding minor tasks atm are:

[x] Porting the distance-based edge functions (KNN, thresholding etc)
[ ] Porting the node/edge/graph feature conversion into dataframes
[ ] Outputting the adjacency tensors
[x] Utility functions for converting graphs into graph deep learning objects of choice.
[x] Refactoring existing visualisation functions
[ ] Add dataframe type hinting with DataEnforce

I think this should be fairly quick - I should be able to wrap these things this week. I think we're getting to the point where we need to start considering edges cases a bit more closely (e.g. non-standard amino acids). These will break a number of the feature computations if we don't handle them.

We've also had a great contribtuion from @rvinas to the main branch - we now have support for creating PPI Graphs (and I added RNA graphs a while back). Next week I'll look at porting this over to the new API.

Beyond this, there are a number of accessory features I think would be useful:

[x] Subsetting structures to binding pockets/regions of interest
[ ] Knowledge-based dataset splitting based on structural similarity (we can include sequence too, but I think it's bad practice)
[ ] Runnable command line utilities for batch processing
[ ] Integrating PPI graphs with protein stuctural graphs

Keen to hear what you think about some of these things!

Hello @a-r-j! Great to hear from you again. I remember you headed down to Spain/Portugal, enjoying your time there?

Looking at the list, this is really cool! I have a few suggestions to make the list even more awesome, hope you don't mind!

Dataframe type hinting: I'd suggest we use pandera. It's got all of the things you're looking for from DataEnforce, plus more.

Subsetting structures: Depending on what category of subsetting we're going for, there's a bit of complexity to untangle.

I once wrote a function that uses NetworkX's API and grabs out a radius of nodes from a central starter node. I can provide it here if that's helpful. The API looks something like this:

def subset(G: nx.Graph, central_node, radius):
    # stuff happens resulting in a subset of nodes being identified, and then...
    return G.subgraph(subset_nodes)

That is the cleanest "subsetting" function I can think of, but it also risks being not biologically relevant. If you're game for it, I'll submit a small PR to stick that in an appropriate place in the library.

There's others that are racing through my head right now, but they're not "clean" implementations. Finding homologous structures is one thread of thought, but I am not quite sure how to accomplish this using just graph objects.

I'm back in (the other) Cambridge these days. Missing the sunshine but it's not much to complain about compared to a New England winter!

Happy to go with pandera - I'll check it out!

I've added the distance-based edges from your library now. I took the liberty of adding a edge constructors based on K nearest neighbours and Euclidean distance. I added a long_interaction_threshold The idea is we might sometimes want to ignore edges between residues close in sequence space.

a-r-j / graphein

Graphein Design Doc #19