Arcadia-Science / ProteinCartography

a pipeline to build similarity maps of protein space
MIT License
30 stars 10 forks source link

ProteinCartography

DOI Arcadia Pub run with conda Snakemake

The ProteinCartography pipeline searches sequence and structure databases for matches to input proteins and builds maps of protein space for the purposes of discovery and exploration.

You can find a general overview of the pipeline in the Pub Arcadia Pub for this pipeline. The results of a 25-protein meta-analysis of top-studied human proteins can be found on Zenodo DOI.


Purpose

Comparing protein structures across organisms can help us generate interesting biological hypotheses. This pipeline allows users to build interactive maps of structurally similar proteins useful for discovery and exploration.

Our pipeline starts with user-provided protein(s) of interest and searches the available sequence and structure databases for matches. Using the full list of matches, we can build a "map" of all the similar proteins and look for clusters of proteins with similar features. Overlaying a variety of different parameters such as taxonomy, sequence divergence, and other features onto these spaces allows us to explore the features that drive differences between clusters.

Because this tool is based on global structural comparisons, note that the results are not always useful for long proteins (>1200 amino acids), multi-domain proteins, or proteins with large unstructured regions. Additionally, while we find that the results for average length, well-structured proteins appear generally as expected, we have not yet comprehensively validated the clustering parameters, so users may find that different parameters work better for their specific analyses.


Quickstart

The ProteinCartography pipeline supports Linux and macOS operating systems; it does not work on Windows. It also requires that you have conda or mamba installed. If you have an M series Mac, you will need to install an x86-64 version of conda. See below for some tips on how to do this.

  1. Clone the GitHub repository.
    git clone https://github.com/Arcadia-Science/ProteinCartography.git
  2. Install conda and/or mamba if you don't already have them installed.
  3. Create a conda environment containing the software needed to run the pipeline. Run this code from within the ProteinCartography repo:
    conda env create -f envs/cartography_tidy.yml -n cartography_tidy
    conda activate cartography_tidy
  4. Run the Snakemake pipeline using a demo protein (human ACTB, P60709). Set n to be the number of cores you'd like to use for running the pipeline.
    snakemake --snakefile Snakefile --configfile demo/search-mode/config_actin.yml --use-conda --cores n
  5. Inspect results. In the demo/output/final_results/ directory, you should find the following files:
    • actin_aggregated_features.tsv: metadata file containing protein feature hits
    • actin_aggregated_features_pca_umap.html: interactive UMAP scatter plot of results
    • actin_aggregated_features_pca_tsne.html: interactive t-SNE scatter plot of results
    • actin_leiden_similarity.html: mean cluster TM-score similarity heatmap
    • actin_semantic_analysis.html and actin_semantic_analysis.pdf: simple semantic analysis of clusters

Compute Specifications

We have been able to successfully run the pipeline on macOS and Amazon Linux 2 machines with at least 8GB RAM and 8 cores.

For the data generated for our pub, we ran the pipeline on an AWS EC2 instance of type t2.2xlarge (32 GiB RAM + 8 vCPU).

Running the pipeline on an M-series Mac

To run the pipeline on an M-series Mac (with arm64 architecture), you will need to install an x86-64 version of conda. One way to do this is to install Miniconda using the x86-64 .pkg installer from the miniconda website. Your Mac should recognize that the installer is for x86-64 and automatically use Rosetta 2 to run it. Alternatively, you can run the conda installer script from the command line with the arch -x86_64 command. For example:

curl -O  https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
arch -x86_64 /usr/bin/env bash Miniconda3-latest-MacOSX-x86_64.sh

The x86 version of conda will automatically install x86-64 versions of all packages. If you have already installed conda on your M-series Mac, be careful to install the x86 version in a different location. To check that you are using the correct version of conda, you can run conda info and look for the platform field in the output. It should say osx-64.


Modes

The pipeline supports two modes: Search and Cluster. Both modes are implemented in the same Snakefile. The mode in which the pipeline runs is controlled by the mode parameter in the config.yml file.

Search Mode

In this mode, the pipeline starts with a set of input proteins of interest in PDB and FASTA format and performs broad BLAST and Foldseek searches to identify hits. The pipeline aggregates all hits, downloads PDBs, and builds a map.

search-mode-rulegraph

Inputs

Walkthrough

  1. Follow steps 1-3 in the Quickstart section above to set up and activate the conda environment.
  2. Set up a config.yml file specifying input and output directories and an analysis name.
  3. Add query protein files in FASTA and PDB format to the input folder.
    • You can pull down proteins with a UniProt accession number and AlphaFold structure using the following command. Replace {accession} with your UniProt accession (e.g. P24928). python ProteinCartography/fetch_accession.py -a {accession} -o input -f fasta pdb This saves a FASTA file from UniProt and a PDB file from AlphaFold to the input/ folder.
    • For non-UniProt proteins, users can provide a FASTA file (one entry per file).
      • If the protein is <400aa, the pipeline can generate a custom PDB file using the ESMFold API.
      • If the protein is >400aa, you should provide a PDB file with a matching prefix, which you can generate using ColabFold or other software.
  4. Run this command, replacing the config.yml with your config file and n with the number of cores you want to allocate to Snakemake.
    snakemake --configfile config.yml --use-conda --cores n

Cluster Mode

In this mode, the pipeline starts with a folder containing PDBs of interest and performs just the clustering and visualization steps of the pipeline, without performing any searches or downloads.

cluster-mode-rulegraph

Inputs

Walkthrough

  1. Follow steps 1-3 in the Quickstart section above to set up and activate the conda environment.
  2. Set up a config.yml file specifying input and output directories and an analysis name.
  3. Add input protein structures in PDB format to your input folder.
    • The pipeline does not yet support PDBs with multiple chains.
  4. Make a uniprot_features.tsv file. You can generate this one of two ways. If you have a list of UniProt accessions, you can provide that as a .txt file and automatically pull down the correct annotations. Alternatively, you can manually generate the file.
    • 3.1 Generating the file using fetch_uniprot_metadata.py
      • Create a uniprot_ids.txt file that contains a list of UniProt accessions, one per line.
      • Run the following code from the base of the GitHub repository. Modify the input folder path to your input folder.
        python ProteinCartography/fetch_uniprot_metadata.py --input uniprot_ids.txt --output input/uniprot_features.tsv
    • 3.2 Manually generating the file
      • For each protid in the dataset, you should gather relevant protein metadata.
      • The first column of the file should be the unique protein identifer, with protid as the column name.
      • You should have additional columns for relevant metadata as described in Feature file main columns.
      • Default columns that are missing will be ignored.
  5. Run this command, replacing the config.yml with your config file and n with the number of cores you want to allocate to Snakemake.
    snakemake --configfile config.yml --use-conda --cores n

Directory Structure


Pipeline Overview

The Search mode of the pipeline performs all of the following steps.\ The Cluster mode starts at the "Clustering" step.

Protein Folding

  1. Fold input FASTA files using ESMFold.
    • The pipeline starts with input FASTA and/or PDB files, one entry per file, with a unique protein identifier (protid).
    • If a matching PDB file is not provided, the pipeline will use the ESMFold API to generate a PDB file for FASTA sequences shorter than 400aa.

Protein Search

  1. Search the AlphaFold databases using queries to the Foldseek webserver API for each provided .pdb file.

    • The pipeline searches afdb50, afdb-proteome and afdb-swissprot for each input protein and aggregate the results. You can customize to add or remove databases using the config file.
    • Currently, the pipeline is limited to retrieving a maximum of 1000 hits per protein, per database.
  2. Search the non-redundant GenBank/RefSeq database using blastp for each provided .fasta file.

    • Takes the resulting output hits and maps each GenBank/RefSeq hit to a UniProt ID using requests and the UniProt REST API.
    • TODO: This can fail for large proteins (>700aa) due to remote BLAST CPU limits. To overcome this error, you can manually run BLAST locally or via the webserver and create an accession list file with the name format {protid}.blast_hits.refseq.txt in the output/blast_results/ directory.

Download Data

  1. Aggregate the list of Foldseek and BLAST hits from all input files into a single list of UniProt IDs.

  2. Download annotation and feature information for each hit protein from UniProt.

  3. Filter proteins based on key UniProt metadata. The pipeline removes:

    • Proteins marked as fragments by UniProt.
    • Deleted or outdated proteins.
    • Proteins larger than or smaller than user-specified maximum and minimum size parameters, specified in the config.yml file.
  4. Download a .pdb file from each remaining protein from AlphaFold.

    • This part is a bit slow, as it's currently limited to the number of cores provided to Snakemake.
    • Sometimes this will fail, possibly due to rate limits. Restarting Snakemake with the --rerun-incomplete flag usually resolves this.

Clustering

  1. Generate a similarity matrix and cluster all protein .pdb files using Foldseek.

    • By default, foldseek search masks results with an e-value > 0.001. We set these masked values to 0.
    • By default, foldseek search returns at most 1000 hits per protein. For spaces where there are >1000 proteins, this results in missing comparisons. We also set missing values to 0.
    • When all hit proteins are highly similar, these settings can result in artefactually steep cutoffs in TM-score in the visualization. A TM-score of 0 in the scatter plot does not mean that there is no structural similarity - it instead is likely that the protein's structural similarity was not within the top 1000 hits. In the future, we plan to perform this filtering via e-value, and to re-calculate the TM-score for every input protein versus every hit to more accurately display in the visualization.
  2. Perform dimensionality reduction and clustering on the similarity matrix.

Data Analysis and Aggregation

A note about nomenclature: the meaning of the phrase key_protid used in the filenames below is mode-dependent: in 'search' mode, the 'key' protids are simply the input protids, while in 'cluster' mode, the key protids are the protids specified in the key_protids list in the config.yml file.

  1. Generate a variety of *_features.tsv files.

    • Each file has, as its first column, a list of protein ids (protid) that are shared between files.
    • We query UniProt to get all metadata from that service as a uniprot_features.tsv file.
    • Foldseek generates a struclusters_features.tsv file.
    • We perform Leiden clustering to generate a leiden_features.tsv file.
    • We use Foldseek to calculate the TM-score for the input protids (in search mode) or the 'key' protids (in cluster mode) versus all of the protids to generate a key_protid_tmscore_features.tsv file.
    • We extract from Foldseek search a fraction sequence identity for every protid in our input/key protids as <key_protid>_fident_features.tsv files.
    • We subtract the fraction sequence identity from the TM-score to generate a <key_protid>_concordance_features.tsv file.
    • We determine the source of each file in the analysis (whether it was found from blast or Foldseek) as the source_features.tsv file.
  2. Aggregate features.

    • All of the *_features.tsv files are combined into one large aggregated_features.tsv file.

Plotting

  1. Calculate per-cluster structural similarities.

    • For every Leiden or Foldseek structural cluster of proteins, get the mean structural similarity within that cluster and between that cluster and every other cluster.
    • Plot it as a heatmap with suffix _leiden_similarity.html or _structural_similarity.html.
  2. Perform simple semantic analysis on UniProt annotations.

    • For the annotations in each cluster, aggregate them and count the frequency of every full annotation string.
    • Perform a word count analysis on all of the annotations and generate a word cloud.
    • Save as a PDF file with suffix _semantic_analysis.pdf.
  3. Perform simple statistical tests on the aggregated features and create a violin plot.

    • For every feature, perform a Mann-Whitney U test between the cluster containing each input protein (for Search mode) against each other cluster, or for each cluster against every other cluster (for Cluster mode).
    • Save the results as a _distribution_analysis.svg image file.
  4. Build an explorable HTML visualization using Plotly based on the aggregated features.

    • An example can be found here

    • Each point has hover-over information.

    • Default parameters include:

      • Leiden Cluster: Protein cluster as called by scanpy's implementation of Leiden clustering.
      • Annotation Score: UniProt annotation score from 0 to 5 (5 being the best evidence).
      • Broad Taxon: the broad taxonomic category for the source organism. There are two modes: 'euk' and 'bac'.
        • Assigns the "smallest" taxonomic scope from the rankings below for each point. So, a mouse gene would get Mammalia but not Vertebrata.
        • For 'euk', uses the taxonomic groups Mammalia, Vertebrata, Arthropoda, Ecdysozoa, Lophotrochozoa, Metazoa, Fungi, Viridiplantae, Sar, Excavata, Amoebazoa, Eukaryota, Bacteria, Archaea, Viruses
        • For 'bac', uses the taxonomic groups Pseudomonadota, Nitrospirae, Acidobacteria, Bacillota, Spirochaetes, Cyanobacteria, Actinomycetota, Deinococcota, Bacteria, Archaea, Viruses, Metazoa, Fungi, Viridiplantae, Eukaryota
      • Length: length of the protein in amino acids.
      • Source: how the protein was added to the clustering space (blast, foldseek or blast+foldseek).
    • Power users can customize the plots using a variety of rules, described below.


Plotting Rules for plot_interactive()

The plot_interactive() function has two required arguments:

The plotting_rules dictionary should have the following format. Each column is an entry in the dictionary containing a dictionary of rules.

{
    'column1.name': {
        'type': 'categorical',
        'parameter1': value,
        'parameter2': value,
        ...
    }
    'column2.name': {
        'type': 'hovertext',
        ...
    }
}

The possible rules for each column are as follows:

For any plot type

For 'continuous' plots

For 'categorical' and 'taxonomic' plots

For 'taxonomic' plots


File Conventions

The pipeline generates a large number of .txt and .tsv files with specific formatting expectations. Many of the pipeline's scripts accept these specific format conventions as input or return them as output. These are the primary formats and their descriptions.

Accession list files (ACC)

These files end with '.txt' and contain a list of accessions (RefSeq, GenBank, UniProt), one per line.

Matrix File (MTX)

These files end with '.tsv' and contain distance or similarity matrices, usually all-v-all.

Features files (FTF)

These files end with '.tsv' and contain a protid column, which is the unique identifier of each protein in the dataset. The remaining columns are metadata for each protein. These metadata can be of any data type.

Feature file main columns

A variety of metadata features for each protein are usually pulled from UniProt for visualization purposes. An example features_file.tsv is provided as part of the repo.

If you are providing a set of custom proteins (such as those not fetched from UniProt) when using the Search mode, you may want to include a features_override.tsv file that contains these features for your proteins of interest. This will allow you to visualize your protein correctly in the interactive HTML map. You can specify the path to this file using the features_override_file parameter in config.yml.

When using Cluster mode, you should provide protein metadata in a uniprot_features.tsv file; specify the path to this file using the features_file parameter in the config file.

Note that the override_file parameter also exists in the Cluster mode. The difference between features_file and override_file is that the former is used as the base metadata file (replacing the uniprot_features.tsv file normally retrieved from UniProt, whereas the latter is loaded after the base metadata file, replacing any information pulled from the features_file. In the Search mode, you can therefore use the override_file parameter to correct errors in metadata provided by UniProt or replace values for specific columns in the uniprot_features.tsv file that is retrieved by the pipeline.

For either custom proteins provided through override_file in either mode, or base metadata provided by features_file in Cluster mode, you should strive to include the default columns in the table below. Features used for color schemes in the default plotting rules are marked with (Plotting) below. Features used only for hover-over description are marked with (Hovertext).

feature example description source
"protid" "P42212" (Required) the unique identifier of the protein. Usually the UniProt accession, but can be any alphanumeric string User-provided or UniProt
"Protein names" "Green fluorescent protein" (Hovertext) a human-readable description of the protein UniProt
"Gene Names (primary)" "GFP" (Hovertext) a gene symbol for the protein UniProt
"Annotation" 5 (Plotting) UniProt Annotation Score (0 to 5) UniProt
"Organism" "Aequorea victoria (Water jellyfish) (Mesonema victoria)" (Hovertext) Scientific name (common name) (synonyms) UniProt
"Taxonomic lineage" "cellular organisms (no rank), Eukaryota (superkingdom), ... Aequoreidae (family), Aequorea (genus)" string of comma-separated Lineage name (rank) for the organism's full taxonomic lineage UniProt
"Lineage" ["cellular organisms", "Eukaryota", ... "Aequoreidae", "Aequorea"] (Plotting) ordered list of lineage identifiers without rank information, generated from "Taxonomic lineage" ProteinCartography
"Length" 238 (Plotting) number of amino acids in protein UniProt


conda Environments

The pipeline uses a variety of conda environments to manage software dependencies. The major conda environments are:


Contributing

Please see the contributing guidelines for more information.