bigbio / proteomics-sample-metadata

The Proteomics sample metadata: Standard for experimental design annotation in proteomics datasets
GNU General Public License v2.0
78 stars 108 forks source link
mage-tab metadata msrun-metadata multiomics pride-metadata proteomexchange proteomics proteomics-community proteomics-datasets proteomics-experiments sdrf sdrf-proteomics

Proteomics Sample Metadata Format

Version License Open Issues Open PRs Contributors Watchers Stars

Improving metadata annotation of Proteomics datasets

Metadata is essential in proteomics data repositories and is crucial to interpret and reanalyze the deposited data sets. While the dataset general description and standard data file formats are supported and captured for every dataset by ProteomeXchange partners, the information regarding the sample to data files is mostly missing. Recently, members of the European Bioinformatics Community for Mass Spectrometry (EuBIC - https://eubic-ms.org/) have created this open-source project to enable the standardization of sample metadata of public proteomics data sets.

The Proteomics Sample Metadata Project aims to standardize the way ProteomeXchange partners and the proteomics community capture the relation between the samples and the data generated within a PX submission. We have adapted the MAGE-TAB v1.1 format to capture the necessary metadata for Proteomics experiments to allow automated re-processing. The MAGE-TAB (MicroArray Gene Expression Tabular) is the file format to store the metadata and sample information on transcriptomics experiments. By repurposing and extending the MAGE-TAB for Proteomics, we aim to provide a format for future submissions of multiomics experiments to ProteomeXchange partners and better integration with other omics data. The MAGE-TAB is divided in two main files: IDF (Investigation Description Format) and SDRF (Sample and Data Relationship Format). We will describe how these two files are adapted for Proteomics.

Our goal is to ensure maximum reusability of the deposited data. Our work aims to define the minimum information required to report the experimental design of proteomics experiments, enabling the use and reuse of the deposited data by the proteomics community. The following Use Cases should be considered to design the Proteomics Sample Metadata Format:

IDF

ProteomeXchange resources developed a file format called submission.px which captures the same information as the MAGE-TAB IDF. We have developed a set of tools to automatically translate from submission.px to IDF.

SDRF (SDRF-Proteomics)

While the experiment general description is captured for all the PX submissions and experiments, the Sample to Data information is missing (or not standardized) for all PX datasets. The standardization of the SDRF (within MAGE-TAB) for proteomics is the main objective of this project (Read more about SDRF-Proteomics)

Final PSI specification

The final HUPO-PSI specification is: SDRF HUPO-PSI

How to contribute

External contributors, researchers and the proteomics community are more than welcome to contribute to this project.

Contribute with the specification: you can contribute to the specification with ideas or refinements by adding an issue into the issue tracker or performing a PR.

In the annotated projects folder, users can see different public datasets that have been annotated so far by the contributors. If you would like to join these efforts, make a Fork of this repo and perform a pull request (PR) with your annotated project. If you don't have a project in mind, you can take one project from the issues and perform the annotation.

Annotate a dataset in 5 steps:

In order to validate your SDRF, you can install the sdrf-pipelines tool in Python

pip install sdrf-pipelines

validate the SDRF file

parse_sdrf validate-sdrf --sdrf_file PXD020294.sdrf.tsv

You can read more about the validator here.

30 Minutes Guide to MAGE-TAB for Proteomics

We have created a 30-minute Guide to the file format in the github repository. Additionally, the following materials are relevant for new users:

Core contributors and collaborators

The project is run by different groups:

IMPORTANT: If you contribute with the following specification, please make sure to add your name to the list of contributors.

Code of Conduct

As part of our efforts toward delivering open and inclusive science, we follow the Contributor Covenant Code of Conduct for Open Source Projects.

How to cite

Copyright notice

This information is free; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This information is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this work; if not, write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.