Open ericmjl opened 4 years ago
Hi Eric, thanks for writing this up. It's super comprehensive! You definitely took better notes than I did.
I'm going to start refactorising the protein graph component this week into the functional paradigm outlined here. I don't think it'll be too taxing. I think we can continue on the dev
branch. I'll make a PR to add the basic plotting functionality that exists there to the extant version of graphein so that it's available to people already using graphein until we're happy to share v0.1.0!
I've added some checkboxes to your comment so we can keep track of where we're at. I'll ping you back once it's in a state to start writing some tests
I'm really excited to work with you on this! It's a great chance to learn and hopefully we can build something much more useful for the community.
@a-r-j I made a call to stick those three helper functions inside NetworkX - I think this might help simplify Graphein a bit more, so we can have laser-like focus on functions that take a node and return a pandas Series of amino acid scales. (Here's the PR to NetworkX, btw.)
One resource a colleague pointed me to is EXPASY's amino acid scales. There, we have a ton of amino acid descriptors (something like 50+), which can be used to generate an amino acid node feature set. I put up the script up as a gist, took us about an hour to parse the webpages, but eventually "webpage == API" became a reality :smile:. I think it should be fair game to bundle up the CSV file inside Graphein (kinda like, "just distribute it"), and then use that CSV file in conjunction with the following node-level function:
aa_feats = pd.read_csv("amino_acid_properties.csv", index_col=0)
def featurize_amino_acid(n, d, aa_feats):
aa = d["residue_name"]
feats = pd.Series(aa_feats[aa], name=n)
return feats
We can also break up the function a bit too, making it configurable for an end-user "what features should be used to describe an amino acid". With the generate_X
functions in NetworkX, and with this baseline, I think it'll be a great starting point, and the interoperability with the rest of the PyData ecosystem can really strengthen the impact of the package.
@ericmjl I've added the datafile for the Meiler embeddings and a loading function. Something we need to think about is how we handle non-standard/modified amino acids. Those that I'm aware of:
MSE,PTR,TPO,SEP,KCX,LLP,PCA,CSO,CAS,CAF,CSD
Biological data is never as straightforward as we'd like 😅
What I've done previously is to use the values for the 'parent' amino acid (e.g. using the values for MET for MSE). This is obviously a bit dirty and not entirely valid but I'm sure what we can do short of generating the values for the non-standard AAs. Perhaps we could provide some options for doing that workaround, initialising the features with 0s or the mean values.
In any case, we'll need a way to handle non-standard residues in the lookup functions to avoid throwing errors.
Biological data is never as straightforward as we'd like 😅
:joy: Biology breaks all rules indeed! 😅
I wonder what you think about this: for now, we stick to just the standard amino acids? Imposing this first constraint helps with shipping something useful, which is good for publication purposes too, before expanding out to more edge-y cases.
Are non-standard amino acids common in PDB structures? Just wondering; if they're only a small fraction, then leaving them temporarily out of scope while we give more thought to this special category of amino acids, and let time help crystallize out its relations to standard amino acids, could help avoid local minima traps with code structure/programming idioms.
If it helps, I've been thinking about how to handle non-standard amino acids in jax-unirep as well. Because non-standard a.a.s are a human-generated thing, like all human-generated things there's going to be a ton of ways other humans won't conform to "standard".
I think it's less of a problem for human proteins.
My understanding is that they're more common (but still relatively rare) in bacteria and archaea. I think there are two main sources for them:
I don't have a full set of numbers for how common they are, but as I said, I've run into this problem on certain standard datasets I've worked on like PDBBind. This page says there are >700 structures with D-amino acids and 5 with PYL. That's probably a good argument for ignoring them for the time being.
I guess the easiest solution is to flag these proteins and users can decide whether or not they are happy to exclude them from their datasets.
If we're happy to proceed on the basis of ignoring them, I'll add a valid_protein()
function to do some sanity checks on the structures/sequence before the graph construction
Genetically encoded (CSE, PYL), sometimes called the 21st and 22nd amino acids which are present in all domains of life iirc. These would be more of a priority.
I learned something new today! That and D-amino acids being present in crystal structures too. This is cool. Is there a special representation in PDB files for a D-amino acid? If so, it's worth adding a featurization function called "amino_acid_chirality", perhaps?
If we're happy to proceed on the basis of ignoring them, I'll add a valid_protein() function to do some sanity checks on the structures/sequence before the graph construction
I think that'll help, especially in the early stages when the package is a bit more constrained in scope.
I had a quick look at the PDB file for 5E5T
. It seems they define the D-amino acids as HETATMS in the header:
HETNAM DSN D-SERINE
HETNAM DSG D-ASPARAGINE
HETNAM DPN D-PHENYLALANINE
HETNAM DCY D-CYSTEINE
HETNAM DAS D-ASPARTIC ACID
HETNAM DLY D-LYSINE
HETNAM DLE D-LEUCINE
HETNAM DAR D-ARGININE
HETNAM DAL D-ALANINE
HETNAM DTY D-TYROSINE
HETNAM DIL D-ISOLEUCINE
HETNAM DGL D-GLUTAMIC ACID
HETNAM DVA D-VALINE
HETNAM DPR D-PROLINE
HETNAM DTH D-THREONINE
HETNAM DHI D-HISTIDINE
HETNAM FMT FORMIC ACID
HETNAM EDO 1,2-ETHANEDIOL
HETSYN EDO ETHYLENE GLYCOL
The PDB maintains a list of chemical entities and their abbreviations so it should be a fairly straightforward lookup. Hopefully, these conventions are standard in practice. We'd probably want to label non-standard residues differently to the other hetatms in the dataframe to allow for more nuanced filtering E.g. keep non-standard residues but remove all the water molecules. In any case, I think we should probably keep one of those dictionaries in graphein for easy conversion between PDB ligand identifiers and SMILES/InCHI, though not a priority at this point.
On that page there's also a list of the PDB structures that contain each ligand which might be useful, though I'm not sure how regularly updated it is.
Lemme raise an issue for that point you made above, so it's trackable in the pseudo-todo list that is the Issue tracker.
Hey, @ericmjl. Happy Holidays!
I'm back on Graphein (mostly) full-time now. I've added some functionality for computing sequence-level embeddings from some pre-trained language models (ESM, BioVec). I plan to add a bunch from other sources (eg DSSP, propy, etc). @cch1999 contributed some edge computing functions so the we have a MVP of the protein graph construction now.
The outstanding minor tasks atm are:
I think this should be fairly quick - I should be able to wrap these things this week. I think we're getting to the point where we need to start considering edges cases a bit more closely (e.g. non-standard amino acids). These will break a number of the feature computations if we don't handle them.
We've also had a great contribtuion from @rvinas to the main branch - we now have support for creating PPI Graphs (and I added RNA graphs a while back). Next week I'll look at porting this over to the new API.
Beyond this, there are a number of accessory features I think would be useful:
Keen to hear what you think about some of these things!
Hello @a-r-j! Great to hear from you again. I remember you headed down to Spain/Portugal, enjoying your time there?
Looking at the list, this is really cool! I have a few suggestions to make the list even more awesome, hope you don't mind!
Dataframe type hinting: I'd suggest we use pandera. It's got all of the things you're looking for from DataEnforce, plus more.
Subsetting structures: Depending on what category of subsetting we're going for, there's a bit of complexity to untangle.
I once wrote a function that uses NetworkX's API and grabs out a radius of nodes from a central starter node. I can provide it here if that's helpful. The API looks something like this:
def subset(G: nx.Graph, central_node, radius):
# stuff happens resulting in a subset of nodes being identified, and then...
return G.subgraph(subset_nodes)
That is the cleanest "subsetting" function I can think of, but it also risks being not biologically relevant. If you're game for it, I'll submit a small PR to stick that in an appropriate place in the library.
There's others that are racing through my head right now, but they're not "clean" implementations. Finding homologous structures is one thread of thought, but I am not quite sure how to accomplish this using just graph objects.
I'm back in (the other) Cambridge these days. Missing the sunshine but it's not much to complain about compared to a New England winter!
Happy to go with pandera - I'll check it out!
I've added the distance-based edges from your library now. I took the liberty of adding a edge constructors based on K nearest neighbours and Euclidean distance. I added a long_interaction_threshold
The idea is we might sometimes want to ignore edges between residues close in sequence space.
Hey @a-r-j, just pinging back here as promised yesterday with what probably could be considered a design doc for Graphein. I'm excited about this! I also know that my memory is gradually degrading as I get older, so please correct me if I'm wrong anywhere, and don't hesitate to directly edit my post.
Authors
Introduction
Graphein is an awesome idea: basically, a package that flexibly handles the construction of graph representations of biological data, with the goal of democratizing access of biological graph data to all (machine learners and biologists alike).
In its current state, the package might be too opinionated on what backing packages are used, which affects the portability of the data structures. We would like to propose an improved software design that leverages the ecosystem of the Python data science stack.
Graph Preliminaries
Graphs can be represented in both object form and array form, and this gives us different things.
In the object form, we have node lists and metadata, as well as edge lists and metadata. NetworkX graph objects are one example of this. The object form is rich and flexible - it can store information of arbitrary depth. It can store arbitrary key-value paired metadata. Any hashable object can be a node, as long as the hashable object shows up only once in the data structure. Metadata values can be any Python object. This level of flexibility allows us to construct very human-readable graphs in Python.
In the array form, we have node feature arrays and adjacency matrix-like matrices. This pair of objects is more restricted. For downstream deep learning purposes, the data types of these arrays have to be numeric in nature. Nodes aren't easily indexed by a string name; integer indexing is required. Human auditability of the internal data structure is hampered by the fact that the data are represented in numbers. The metadata are also not easily indexed by string names. That said, the array form is structured and efficient for computation.
In terms of a principled data preparation flow:
Design goals and implications
Implication: Target NetworkX graphs, Pandas DataFrames, XArray DataArrays before dispatching/converting to specialized data structures like DGL graphs. This is in line with democratization, as these are the idiomatic and easily accessible data structures of the PyData community.
Implication: Cleanly separate data loading from calculation of derivative data (e.g. things that can be looked up in a lookup table, or other simple/complex calculations), and avoid mixing them up in the same step.
conda install graphein
Implication: Simplify the installation instructions to follow PyData community idioms as much as possible.
black
) to construct these graphs.Implication: APIs that work at different levels are needed. A high-level API for ease-of-use that targets an opinionated subset of the range of possible edge types, adjacency-like matrices, and node features, leveraging a lower-level API that power users can drop down to for customization of how they want to construct a graph for their biological data. This will prevent Graphein from being a god-like package that makes things very easy for end-users, but hamstrings power users from customizing things.
Lower Level API
The key categories of steps that we probably need to be concerned about here are as follows:
Together, these probably form the low-level API, on which the high-level convenience API can be built, which essentially provides a mapping from keyword arguments to function call dispatches underneath the hood.
In a bit more detail, I'd like to propose additional structure for the steps.
Construction of NetworkX graphs from PDB DataFrames
This should be opinionated, and only have two flags: whether to use an "atom" graph, or whether to use a "biological unit" graph (amino acids for proteins, nucleotides for RNA/DNA).
Annotating node metadata
The pattern would be a dispatching function, possibly named
annotate_node_metadata(G, funcs)
.funcs
is a list of functions that gets looped over and called on for each node:Each func modifies
G
in-place, and has a uniform signature of(G, n)
:Now, we can have a parent function that loops over a bunch of
add_some_edge_type(G)
class of functions. In a functional programming paradigm, this can be partialled to some defaults for a bunch of default configuration settings:If the custom function relies on external data, that can be partialled out:
In this design, we ask that functions have again a uniform signature, and return a pandas Series. Then, construction of a DataFrame can happen. The return type of the Series should be numeric (float or int), and not be null. This guarantees that the DataFrame should only contain numeric values. We can then freely convert between the dataframe form and the array form.
Construction of adjacency-like matrices
Now, we explore how we can construct the adjacency-like matrices.
It's possible to construct a wide range of adjacency matrices on which message passing can be performed. For example, one might want to do message passing on the 1-degree adjacency matrix, or do it on a 2-degree adjacency matrix where anything within two degrees of separation are given a "1". One may want to leverage the graph Laplacian matrix too. As such, more than just a matrix, we might want to generate an adjacency tensor. A proposed function we could include in the library, or propose to the NetworkX library, is as follows:
In doing this, we generate an XArray 3D tensor of adjacency-like matrices. A prerequisite is that each adjacency matrix generated from
func
should be an XArray DataArray with one tensor dimension being called "name", which houses the name of the adjacency-like matrix. This conversion can be made easier by providing one more function:Now, an end-user can calculate a NumPy array any way they like, but then name them easily and stick them into a uniform data container, such as an XArray DataArray, that allows them to inspect:
A brief pause...
We have spent a lot of time not on the fancy deep learning piece, for a very good reason - because graphs are such a flexible data structure, we have to have patterns in place to help build compatibility between different graph types. By targeting common data structures, like NetworkX graph objects, NumPy arrays, Pandas DataFrames and XArray DataArrays, we ensure native compatibility with a very broad swathe of data science computing environments. This fulfills the goal of democratization; use of Pandas and XArray ensures that we have the ability to easily inspect the array forms of the data before we use it for any deep learning purposes. And by not reinventing the wheel, we lessen the maintenance burden on ourselves.
The High Level API
We'll now go into how the high level API can look like. With a dummy configuration that looks something like this:
This is what we think the high-level API might look like. It's simple (users only have to remember
read_pdb
andGrapheinConfig
), comes with a lot of sane defaults in the Config object, and dispatches to the right routine.Lessons learned from reimplementing UniRep tells me (EM) that having users manually write code to preprocess data into tensor form introduces the possibility that they could unexpectedly generate an adversarial-like example that has all the right tensor properties (shape, values, no nulls), but is semantically incorrect because the values are preprocessed incorrectly. Hence, a high level API that accepts the idiomatic data structure (in this case, a PDB file, just as in the case of a protein sequence, a string is the idiomatic data structure) that also handles any preprocessing correctly is going to be very enabling for reproducibility purposes.