bigbio / py-pgatk

Python tools for proteogenomics analysis toolkit
Apache License 2.0
10 stars 11 forks source link
ensembl mass-spectrometry proteogenomics proteogenomics-analysis-toolkit proteomics python vcf

ProteoGenomics Analysis Toolkit

Python application install with bioconda Codacy Badge PyPI version PyPI - Downloads

pypgatk is a Python library - part of the ProteoGenomics Analysis Toolkit. It provides different bioinformatics tools for proteogenomics data analysis.

Requirements:

The package requirements vary depending on the way that you want to install it (you need one of the following):

Installation

pip

You can install pypgatk with pip:

pip install pypgatk

Bioconda

You can install pypgatk with bioconda (please setup conda and the bioconda channel if you haven't first, as explained here):

conda install pypgatk

Available as a container

You can use the pypgatk tool already setup on a Docker container. You need to choose from the available tags here and replace it in the call below where it says <tag>.

docker pull quay.io/biocontainers/pypgatk:<tag>

NOTE: Please note that Biocontainers containers do not have a latest tag, as such a docker pull/run without defining the tag will fail. For instance, a valid call would be (for version 0.0.2):

docker run -it quay.io/biocontainers/pypgatk:0.0.2--py_0

Inside the container, you can either use the Python interactive shell or the command line version (see below).

Use latest source code

Alternatively, for the latest version, clone this repo and go into its directory, then execute pip3 install . :

git clone https://github.com/bigbio/py-pgatk
cd py-pgatk
# you might want to create a virtualenv for pypgatk before installing
pip3 install .

Usage

The pypgatk design combines multiple modules and tools into one framework. All the possible commands are accessible using the commandline tool pypgatk_cli.py.

The library provides multiple commands to download, translate and generate protein sequence databases from reference and mutation genome databases.

$: pypgatk_cli -h

Usage: pypgatk [OPTIONS] COMMAND [ARGS]...

  This is the main tool that give access to all commands and options
  provided by the pypgatk

Options:
  --version   Show the version and exit.
  -h, --help  Show this message and exit.

Commands:
  cbioportal-downloader    Command to download the the cbioportal studies
  cbioportal-to-proteindb  Command to translate cbioportal mutation data into
                           proteindb
  cosmic-downloader        Command to download the cosmic mutation database
  cosmic-to-proteindb      Command to translate Cosmic mutation data into
                           proteindb
  dnaseq-to-proteindb      Generate peptides based on DNA sequences
  ensembl-check            Command to check ensembl database for stop codons,
                           gaps
  ensembl-downloader       Command to download the ensembl information
  generate-decoy           Create decoy protein sequences using multiple
                           methods DecoyPYrat, Reverse/Shuffled Proteins.
  generate-deeplc          Generate input for deepLC tool from idXML,mzTab or
                           consensusXML
  msrescore-configuration  Command to generate the msrescore configuration
                           file from idXML
  peptide-class-fdr        Command to compute the Peptide class FDR
  threeframe-translation   Command to perform 3'frame translation
  vcf-to-proteindb         Generate peptides based on DNA variants VCF files

Full Documentation

https://pgatk.readthedocs.io/en/latest/pypgatk.html

Cite as

Husen M Umer, Enrique Audain, Yafeng Zhu, Julianus Pfeuffer, Timo Sachsenberg, Janne Lehtiö, Rui M Branca, Yasset Perez-Riverol Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides Bioinformatics, Volume 38, Issue 5, 1 March 2022, Pages 1470–1472 https://doi.org/10.1093/bioinformatics/btab838