bigbio / ibaqpy

Absolute quantification package based on quantms.io
https://quantms.org
MIT License
11 stars 8 forks source link
absolute-quantification ibaq mass-spectrometry proteomics quantification

ibaqpy

Python application Upload Python Package Codacy Badge PyPI version PyPI - Downloads

iBAQ (Intensity-Based Absolute Quantification) determines the abundance of a protein by dividing the total precursor intensities by the number of theoretically observable peptides of the protein manuscript here. ibaqpy is a Python package that computes iBAQ values starting from a feature parquet from quantmsio and a SDRF file. In addition, the package computes other ibaq values including rIBAQ, log2, and ppb.

ibaqpy also allows computing the TPA value (Total Protein Approach). TPA is determined by summing peptide intensities of each protein and then dividing by the molecular mass to determine the relative concentration of each protein. By using ProteomicRuler, it is possible to calculate the protein copy number and absolute concentration. The OpenMS tool was used to calculate the theoretical molecular mass of each protein. Similar to the calculation of IBAQ, the TPA value of protein-group was the sum of its intensity divided by the sum of the theoretical molecular mass.

Overview of ibaq-base values computation

As mentioned before, ibaq values are calculated by dividing the total precursor intensities by the number of theoretically observable peptides of the protein. We use the following steps to calculate the iBAQ values:

Note: If protein-group exists in the peptide intensity dataframe, the intensity of all proteins in the protein-group is summed based on the above steps, and then divided by the number of proteins in the protein-group.

Other normalized IBAQ values

From quantms to Ibaq values

Ibaq

The output of quantms is converted into quantms.io feature file. A feature in quantms.io is the combination of the following columns:

In summary, each feature is the unique combination of a peptide sequence including modifications (peptidoform), precursor charge state, condition, biological replicate, run, fraction, isotopic label type, and a given intensity. In order to go from these features into protein ibaq values, the package does the following:

Data preprocessing

In this sectionfeatures2peptides, ibaqpy will do:

Note: At the moment, ibaqpy computes the ibaq values only based on unique peptides. Shared peptides are discarded. However, if a group of proteins share the same unique peptides (e.g., Pep1 -> Prot1;Prot2 and Pep2 -> Prot1;Prot2), the intensity of the proteins is summed and divided by the number of proteins in the group.

Calculate the IBAQ Value

First, peptide intensity dataframe was grouped according to protein name, sample name and condition. The protein intensity of each group was summed. Due to the experimental type, the same protein may exhibit missing peptides in different samples, resulting in variations in the number of peptides detected for the protein across different samples. To handle this difference, normalization within the same group can be achieved by using the formula sum(peptides) / n(n represents the number of detected peptide segments). Finally, the normalized intensity of the protein is divided by the number of theoretical peptides.See details in peptides2proteins.

Note: In all scripts and result files, uniprot accession is used as the protein identifier.

How to install ibaqpy

Ibaqpy is available in PyPI and can be installed using pip:

pip install ibaqpy

You can install the package from code:

  1. Clone the repository:
>$ git clone https://github.com/bigbio/ibaqpy
>$ cd ibaqpy
  1. Install conda environment:
>$ mamba env create -f conda-environment.yaml
  1. Install ibaqpy:
>$ python setup.py install

Collecting intensity files from quantms.org

Absolute quantification files have been stored in the following url:

http://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/absolute-expression/quantms-data/

Inside each project reanalysis folder, the folder proteomicslfq contains the msstats input file with the structure {Name of the project}.{Random uuid}.feature.parquet.

E.g. http://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/absolute-expression/quantms-data/MSV000079033.1/MSV000079033.1-bd44c7e3-654c-444d-9e21-0f701d6dac94.feature.parquet

Major commands

Features to peptides

ibaqpy features2peptides -p tests/PXD003947/PXD003947-featrue.parquet -s tests/PXD003947/PXD003947.sdrf.tsv --remove_ids data/contaminants_ids.tsv --remove_decoy_contaminants --remove_low_frequency_peptides --output tests/PXD003947/PXD003947-peptides-norm.csv
Usage: features2peptides.py [OPTIONS]

Options:
  -p, --parquet TEXT              Parquet file import generated by quantms.io
  -s, --sdrf TEXT                 SDRF file import generated by quantms
  --min_aa INTEGER                Minimum number of amino acids to filter
                                  peptides
  --min_unique INTEGER            Minimum number of unique peptides to filter
                                  proteins
  --remove_ids TEXT               Remove specific protein ids from the
                                  analysis using a file with one id per line
  --remove_decoy_contaminants     Remove decoy and contaminants proteins from
                                  the analysis
  --remove_low_frequency_peptides
                                  Remove peptides that are present in less
                                  than 20% of the samples
  --output TEXT                   Peptide intensity file including other all
                                  properties for normalization
  --skip_normalization            Skip normalization step
  --nmethod TEXT                  Normalization method used to normalize
                                  feature intensities for tec
                                  (options: mean, median, iqr, none)
  --pnmethod TEXT                 Normalization method used to normalize
                                  peptides intensities for all samples
                                  (options: globalMedian,conditionMedian,none)
  --log2                          Transform to log2 the peptide intensity
                                  values before normalization
  --save_parquet                  Save normalized peptides to parquet
  --help                          Show this message and exit.

Compute IBAQ

ibaqpy peptides2proteins --fasta tests/PXD003947/Homo-sapiens-uniprot-reviewed-contaminants-decoy-202210.fasta --peptides tests/PXD003947/PXD003947-peptides-norm.csv --enzyme Trypsin --output tests/PXD003947/PXD003947-ibaq-norm.csv --normalize --verbose

The command provides an additional flag for normalize IBAQ values.

Usage: peptides2proteins [OPTIONS]

Options:
  -f, --fasta TEXT     Protein database to compute IBAQ values
  -p, --peptides TEXT  Peptide identifications with intensities following the
                       peptide intensity output
  -e, --enzyme TEXT    Enzyme used during the analysis of the dataset
                       (default: Trypsin)
  -n, --normalize      Normalize IBAQ values using by using the total IBAQ of
                       the experiment
  --min_aa INTEGER     Minimum number of amino acids to consider a peptide
  --max_aa INTEGER     Maximum number of amino acids to consider a peptide
  -o, --output TEXT    Output file with the proteins and ibaq values
  --verbose            Print addition information about the distributions of
                       the intensities, number of peptides remove after
                       normalization, etc.
  --qc_report TEXT     PDF file to store multiple QC images
  --help               Show this message and exit.

Compute TPA

python compute_tpa --fasta Homo-sapiens-uniprot-reviewed-contaminants-decoy-202210.fasta --organism 'human' --peptides PXD003947-peptides.csv --ruler --ploidy 2 --cpc 200 --output PXD003947-tpa.tsv --verbose
ibaqpyc tpa --help
Usage: tpa [OPTIONS]

  Compute the protein copy numbers and concentrations according to a file output of peptides with the
  format described in peptide_normalization.py.

  :param fasta: Fasta file used to perform the peptide identification
  :param peptides: Peptide intensity file
  :param organism: Organism source of the data
  :param ruler: Whether to compute protein copy number, weight and concentration.
  :param ploidy: Ploidy number
  :param cpc: Cellular protein concentration(g/L)
  :param output: Output format containing the TPA values, protein copy numbers and concentrations
  :param verbose: Print addition information about the distributions of the intensities, 
                  number of peptides remove after normalization, etc.
  :param qc_report: PDF file to store multiple QC images

Options:
  -f, --fasta TEXT      Protein database to compute IBAQ values  [required]
  -p, --peptides TEXT   Peptide identifications with intensities following the peptide intensity output  [required]
  -m, --organism        Organism source of the data.
  -r, --ruler           Calculate protein copy number and concentration according to ProteomicRuler
  -n, --ploidy          Ploidy number (default: 2)
  -c, --cpc             Cellular protein concentration(g/L) (default: 200)
  -o, --output TEXT     Output format containing the TPA values, protein copy numbers and concentrations
  --verbose             Print addition information about the distributions of the intensities, 
                        number of peptides remove after normalization, etc.
  --qc_report           PDF file to store multiple QC images (default: "TPA-QCprofile.pdf")
  --help                Show this message and exit.

Calculate the Cellular Protein Copy Number and Concentration

The protein copy calculation follows the following formula:

protein copies per cell = protein MS-signal *  (avogadro / molecular mass) * (DNA mass / histone MS-signal)

For cellular protein copy number calculation, the uniprot accession of histones was obtained from species first, and the molecular mass of DNA was calculated. Then the dataframe was grouped according to different conditions, and the copy number, molar number and mass of proteins were calculated. In the calculation of protein concentration, the volume is calculated according to the cell protein concentration first, and then the protein mass is divided by the volume to calculate the intracellular protein concentration.

How to cite ibaqpy

Wang H, Dai C, Pfeuffer J, Sachsenberg T, Sanchez A, Bai M, Perez-Riverol Y. Tissue-based absolute quantification using large-scale TMT and LFQ experiments. Proteomics. 2023 Oct;23(20):e2300188. doi: 10.1002/pmic.202300188. Epub 2023 Jul 24. PMID: 37488995.

Credits