The ProteinCartography pipeline searches sequence and structure databases for matches to input proteins and builds maps of protein space for the purposes of discovery and exploration.
You can find a general overview of the pipeline in the Pub for this pipeline. The results of a 25-protein meta-analysis of top-studied human proteins can be found on Zenodo .
Comparing protein structures across organisms can help us generate interesting biological hypotheses. This pipeline allows users to build interactive maps of structurally similar proteins useful for discovery and exploration.
Our pipeline starts with user-provided protein(s) of interest and searches the available sequence and structure databases for matches. Using the full list of matches, we can build a "map" of all the similar proteins and look for clusters of proteins with similar features. Overlaying a variety of different parameters such as taxonomy, sequence divergence, and other features onto these spaces allows us to explore the features that drive differences between clusters.
Because this tool is based on global structural comparisons, note that the results are not always useful for long proteins (>1200 amino acids), multi-domain proteins, or proteins with large unstructured regions. Additionally, while we find that the results for average length, well-structured proteins appear generally as expected, we have not yet comprehensively validated the clustering parameters, so users may find that different parameters work better for their specific analyses.
The ProteinCartography pipeline supports Linux and macOS operating systems; it does not work on Windows. It also requires that you have conda
or mamba
installed. If you have an M series Mac, you will need to install an x86-64 version of conda
. See below for some tips on how to do this.
git clone https://github.com/Arcadia-Science/ProteinCartography.git
conda
and/or mamba
if you don't already have them installed.conda env create -f envs/cartography_tidy.yml -n cartography_tidy
conda activate cartography_tidy
n
to be the number of cores you'd like to use for running the pipeline.
snakemake --snakefile Snakefile --configfile demo/search-mode/config_actin.yml --use-conda --cores n
demo/output/final_results/
directory, you should find the following files:
actin_aggregated_features.tsv
: metadata file containing protein feature hitsactin_aggregated_features_pca_umap.html
: interactive UMAP scatter plot of resultsactin_aggregated_features_pca_tsne.html
: interactive t-SNE scatter plot of resultsactin_leiden_similarity.html
: mean cluster TM-score similarity heatmapactin_semantic_analysis.html
and actin_semantic_analysis.pdf
: simple semantic analysis of clustersWe have been able to successfully run the pipeline on macOS and Amazon Linux 2 machines with at least 8GB RAM and 8 cores.
For the data generated for our pub, we ran the pipeline on an AWS EC2 instance of type t2.2xlarge
(32 GiB RAM + 8 vCPU).
To run the pipeline on an M-series Mac (with arm64 architecture), you will need to install an x86-64 version of conda
. One way to do this is to install Miniconda using the x86-64 .pkg
installer from the miniconda website. Your Mac should recognize that the installer is for x86-64 and automatically use Rosetta 2 to run it. Alternatively, you can run the conda
installer script from the command line with the arch -x86_64
command. For example:
curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
arch -x86_64 /usr/bin/env bash Miniconda3-latest-MacOSX-x86_64.sh
The x86 version of conda
will automatically install x86-64 versions of all packages. If you have already installed conda
on your M-series Mac, be careful to install the x86 version in a different location. To check that you are using the correct version of conda
, you can run conda info
and look for the platform
field in the output. It should say osx-64
.
The pipeline supports two modes: Search and Cluster. Both modes are implemented in the same Snakefile
. The mode in which the pipeline runs is controlled by the mode
parameter in the config.yml
file.
In this mode, the pipeline starts with a set of input proteins of interest in PDB and FASTA format and performs broad BLAST and Foldseek searches to identify hits. The pipeline aggregates all hits, downloads PDBs, and builds a map.
protid
).protid
should be the prefix of the FASTA and PDB files (e.g. P60709.fasta
, P60709.pdb
).config.yml
file.
config.yml
contains the default parameters of the pipeline, which are used if a custom file is not provided.--configfile
flag to Snakemake will overwrite the defaults in config.yml
.input
: directory containing input PDBs and FASTAs.output
: directory where all pipeline outputs are placed.analysis_name
: nickname for the analysis, appended to important output files.config.yml
for additional parameters.features_override.tsv
file.
config.yml
file specifying input and output directories and an analysis name.python ProteinCartography/fetch_accession.py -a {accession} -o input -f fasta pdb
This saves a FASTA file from UniProt and a PDB file from AlphaFold to the input/
folder.config.yml
with your config file and n
with the number of cores you want to allocate to Snakemake.
snakemake --configfile config.yml --use-conda --cores n
In this mode, the pipeline starts with a folder containing PDBs of interest and performs just the clustering and visualization steps of the pipeline, without performing any searches or downloads.
protid
) which matches the PDB file prefix, as described for Search mode.config.yml
file with custom settings.
config.yml
is an example file that contains the defaults of the pipeline.--configfile
flag to Snakemake will overwrite the defaults.input
: directory containing input PDBs and FASTAs.output
: directory where all pipeline outputs are placed.analysis_name
: nickname for the analysis, appended to important output files.features_file
: path to features file (described below).keyids
: a list of one or more key protid
corresponding to the proteins to highlight in the output plots (similar to how the input proteins are highlighted in 'search' mode). Note: if not provided, the output directory key_protid_tmscore_results
will be empty, as will the protein_features/key_protid_tmscore_features.tsv
file.config.yml
for additional parameters.uniprot_features.tsv
but you can use any name.input
directory.config.yml
file specifying input and output directories and an analysis name.uniprot_features.tsv
file.
You can generate this one of two ways. If you have a list of UniProt accessions, you can provide that as a .txt file and automatically pull down the correct annotations. Alternatively, you can manually generate the file.
fetch_uniprot_metadata.py
uniprot_ids.txt
file that contains a list of UniProt accessions, one per line.python ProteinCartography/fetch_uniprot_metadata.py --input uniprot_ids.txt --output input/uniprot_features.tsv
protid
as the column name.config.yml
with your config file and n
with the number of cores you want to allocate to Snakemake.
snakemake --configfile config.yml --use-conda --cores n
The Search mode of the pipeline performs all of the following steps.\ The Cluster mode starts at the "Clustering" step.
protid
).Search the AlphaFold databases using queries to the Foldseek webserver API for each provided .pdb
file.
afdb50
, afdb-proteome
and afdb-swissprot
for each input protein and aggregate the results. You can customize to add or remove databases using the config file.Search the non-redundant GenBank/RefSeq database using blastp for each provided .fasta
file.
requests
and the UniProt REST API.{protid}.blast_hits.refseq.txt
in the output/blast_results/
directory.Aggregate the list of Foldseek and BLAST hits from all input files into a single list of UniProt IDs.
Download annotation and feature information for each hit protein from UniProt.
Filter proteins based on key UniProt metadata. The pipeline removes:
config.yml
file.Download a .pdb
file from each remaining protein from AlphaFold.
--rerun-incomplete
flag usually resolves this.Generate a similarity matrix and cluster all protein .pdb files using Foldseek.
foldseek search
masks results with an e-value > 0.001. We set these masked values to 0.foldseek search
returns at most 1000 hits per protein. For spaces where there are >1000 proteins, this results in missing comparisons. We also set missing values to 0.Perform dimensionality reduction and clustering on the similarity matrix.
scanpy
's Leiden clustering implementation.A note about nomenclature: the meaning of the phrase key_protid
used in the filenames below is mode-dependent: in 'search' mode, the 'key' protids are simply the input protids, while in 'cluster' mode, the key protids are the protids specified in the key_protids
list in the config.yml
file.
Generate a variety of *_features.tsv
files.
uniprot_features.tsv
file.struclusters_features.tsv
file.leiden_features.tsv
file.key_protid_tmscore_features.tsv
file.<key_protid>_fident_features.tsv
files.<key_protid>_concordance_features.tsv
file.source_features.tsv
file.Aggregate features.
*_features.tsv
files are combined into one large aggregated_features.tsv
file.Calculate per-cluster structural similarities.
_leiden_similarity.html
or _structural_similarity.html
.Perform simple semantic analysis on UniProt annotations.
_semantic_analysis.pdf
.Perform simple statistical tests on the aggregated features and create a violin plot.
_distribution_analysis.svg
image file.Build an explorable HTML visualization using Plotly
based on the aggregated features.
An example can be found here
Each point has hover-over information.
Default parameters include:
Mammalia
but not Vertebrata
.Mammalia, Vertebrata, Arthropoda, Ecdysozoa, Lophotrochozoa, Metazoa, Fungi, Viridiplantae, Sar, Excavata, Amoebazoa, Eukaryota, Bacteria, Archaea, Viruses
Pseudomonadota, Nitrospirae, Acidobacteria, Bacillota, Spirochaetes, Cyanobacteria, Actinomycetota, Deinococcota, Bacteria, Archaea, Viruses, Metazoa, Fungi, Viridiplantae, Eukaryota
blast
, foldseek
or blast+foldseek
).Power users can customize the plots using a variety of rules, described below.
plot_interactive()
The plot_interactive()
function has two required arguments:
plotting_rules
dictionary describing how the data should be plottedThe plotting_rules
dictionary should have the following format.
Each column is an entry in the dictionary containing a dictionary of rules.
{
'column1.name': {
'type': 'categorical',
'parameter1': value,
'parameter2': value,
...
}
'column2.name': {
'type': 'hovertext',
...
}
}
The possible rules for each column are as follows:
'categorical'
, 'continuous'
, 'taxonomic'
, or 'hovertext'
np.nan
.''
.0
.lambda x: str(x)
The pipeline generates a large number of .txt and .tsv files with specific formatting expectations. Many of the pipeline's scripts accept these specific format conventions as input or return them as output. These are the primary formats and their descriptions.
These files end with '.txt'
and contain a list of accessions (RefSeq, GenBank, UniProt), one per line.
A0A2J8L4A7
K7EV54
A0A2J8WJR8
A0A811ZNA7
...
These files end with '.tsv'
and contain distance or similarity matrices, usually all-v-all.
Example: | protid | A0A2J8L4A7 | K7EV54 | A0A2J8WJR8 | A0A811ZNA7 |
---|---|---|---|---|---|
A0A2J8L4A7 | 1 | 0.9 | 0.85 | 0.7 | |
K7EV54 | 0.9 | 1 | 0.91 | 0.6 | |
A0A2J8WJR8 | 0.85 | 0.91 | 1 | 0.71 | |
A0A811ZNA7 | 0.7 | 0.6 | 0.71 | 1 |
These files end with '.tsv'
and contain a protid
column, which is the unique identifier of each protein in the dataset.
The remaining columns are metadata for each protein. These metadata can be of any data type.
Example: | protid | Length | LeidenCluster | Organism |
---|---|---|---|---|
A0A2J8L4A7 | 707 | 1 | Pan troglodytes (Chimpanzee) | |
K7EV54 | 784 | 2 | Pongo abelii (Sumatran orangutan) | |
A0A2J8WJR8 | 707 | 1 | Pongo abelii (Sumatran orangutan) | |
A0A811ZNA7 | 781 | 1 | Nyctereutes procyonoides (Raccoon dog) |
A variety of metadata features for each protein are usually pulled from UniProt for visualization purposes. An example features_file.tsv
is provided as part of the repo.
If you are providing a set of custom proteins (such as those not fetched from UniProt) when using the Search mode, you may want to include a features_override.tsv
file that contains these features for your proteins of interest. This will allow you to visualize your protein correctly in the interactive HTML map. You can specify the path to this file using the features_override_file
parameter in config.yml
.
When using Cluster mode, you should provide protein metadata in a uniprot_features.tsv
file; specify the path to this file using the features_file
parameter in the config file.
Note that the override_file
parameter also exists in the Cluster mode. The difference between features_file
and override_file
is that the former is used as the base metadata file (replacing the uniprot_features.tsv
file normally retrieved from UniProt, whereas the latter is loaded after the base metadata file, replacing any information pulled from the features_file
. In the Search mode, you can therefore use the override_file
parameter to correct errors in metadata provided by UniProt or replace values for specific columns in the uniprot_features.tsv
file that is retrieved by the pipeline.
For either custom proteins provided through override_file
in either mode, or base metadata provided by features_file
in Cluster mode, you should strive to include the default columns in the table below. Features used for color schemes in the default plotting rules are marked with (Plotting) below. Features used only for hover-over description are marked with (Hovertext).
feature | example | description | source |
---|---|---|---|
"protid" |
"P42212" |
(Required) the unique identifier of the protein. Usually the UniProt accession, but can be any alphanumeric string | User-provided or UniProt |
"Protein names" |
"Green fluorescent protein" |
(Hovertext) a human-readable description of the protein | UniProt |
"Gene Names (primary)" |
"GFP" |
(Hovertext) a gene symbol for the protein | UniProt |
"Annotation" |
5 |
(Plotting) UniProt Annotation Score (0 to 5) | UniProt |
"Organism" |
"Aequorea victoria (Water jellyfish) (Mesonema victoria)" |
(Hovertext) Scientific name (common name) (synonyms) | UniProt |
"Taxonomic lineage" |
"cellular organisms (no rank), Eukaryota (superkingdom), ... Aequoreidae (family), Aequorea (genus)" |
string of comma-separated Lineage name (rank) for the organism's full taxonomic lineage |
UniProt |
"Lineage" |
["cellular organisms", "Eukaryota", ... "Aequoreidae", "Aequorea"] |
(Plotting) ordered list of lineage identifiers without rank information, generated from "Taxonomic lineage" |
ProteinCartography |
"Length" |
238 |
(Plotting) number of amino acids in protein | UniProt |
conda
EnvironmentsThe pipeline uses a variety of conda environments to manage software dependencies. The major conda environments are:
cartography_tidy
: used to run the pipeline. Includes only dependencies necessary to start the snakemake pipeline, which builds additional environments as needed based on each rule.cartography_dev
: used for development. Includes all dependencies for every of the snakemake pipeline and Python package dependencies together in one environment, plus dependencies for development support (e.g. jupyter
, ipython
) and experimental features not yet implemented in the main pipeline (e.g. pytorch
).cartography_pub
: used to run the Jupyter notebooks in the pub/
directory. Includes just the dependencies needed to run the notebooks.Please see the contributing guidelines for more information.