straussmaximilian commented 2 years ago

DeepScore: Community curated scoring, supercharged with AI

EJ2PqRjXsAIQrb6 (Image taken from https://twitter.com/afterglow2046/status/1197271037009973251) The trained deep learning classifier predicts that this image of a dog is an image of Harrison Ford with a 99% probability. The probability that the image of the dog is NOT Harrison Ford is predicted to be 1%. If Harrison Ford is a target and NOT Harrison Ford a decoy, how would this example translate to proteomics?

Abstract

State of the art for identification in mass-spectrometry-based proteomics is to generate “non-sense”, decoy data and to determine a score cutoff based on a false discovery rate. Nowadays, machine (ML) and deep learning (DL) algorithms are used to learn how to optimally distinguish targets from decoys. While this drastically increases sensitivity, it comes at the cost of explainability, and human-chosen acceptance criteria are replaced with black-box models. This can hinder acceptance in clinical practice, e.g., in peptidomics, where upregulated proteins are investigated by investigating raw data peaks. While other DL-driven domains allow straightforward human validation of the models, e.g., imaging, speech, text, or inspecting a predicted structure from AlphaFold, for proteomics, this is much more challenging. Here, we aim to explore the limitations of the current scoring approach and provide potential solutions. We revisit the idea of confidence by trying to artificially increase identifications with non-sense features, hard decoys, or leaking data. Next, we will build an interactive tool to validate identifications manually and assign human confidence scores. With this, we create a training dataset and build an ML or DL- model to rescore identifications and assign predicted human-level confidence scores.

Project Plan

he hackathon will be organized similarly to a SCRUM process. First, we create a user story map to agree on a prioritized list of things we would like to do. Next, we will assess the workload for each task with story points and, depending on the team size and skillset, make a sprint plan on what we can achieve in the given timeframe. Some of the potential tasks could be:

Assessing Confidence

Hacks: Supplement random numbers to an ML-scoring system and investigate the performance
Hard decoys: Provide harder decoys and investigate the performance
Data Leakage: Gradually leak training data to a scoring system and investigate the performance

Interactive Tool

Frontend that shows raw data and accepts user input to assign confidence scores
Backend with database or functionality to merge multiple user sessions

Deep-learning Score

Extract raw data identifications
Train a model based on the human-supplied confidence scores
Perform rescoring on existing studies

Technical Details

The default language should be Python, but open to everything that gets the job done better. The tools mentioned below are suggestions – there are a lot of great tools out there that I am not aware of, and we should collect and decide what to use at the beginning of the hackathon.

GitHub to host the code and manage the project board
AlphaPept https://github.com/MannLabs/alphapept to use an existing search engine that can be hacked
AlphaTims https://github.com/MannLabs/alphatims for raw data accession
AlphaPeptDeep to build ML/DL models https://github.com/MannLabs/alphapeptdeep (This uses PyTorch) or following the tutorials at https://www.proteomicsml.org
Organizing & tracking ML tasks with: https://neptune.ai/home
To collect data for manual curation we could use MongoDB https://www.mongodb.com
Frontend: https://streamlit.io I use Visual Studio Code with Anaconda and have a couple of DL/ML environments ready to go.
The programming language(s) that will be used.
(If applicable) Any existing software that will be featured.
(If applicable) Any datasets that will be used and their availability.

Hardware

• Laptop, with decent GPU is a plus (set up drivers etc. for compute) • There is always Google Colab as a fallback • For heavier workloads I have access to a high-performance cluster • Alternatively we could rent on Amazon or related.

Datasets

There are a lot of datasets out there we could use, but this we probably narrow down once we have discussed the scope.

Feasibility

I have some preliminary data with human-curated data and some preliminary tools, so we could start with existing code or start from scratch. Most of the modules can be worked on in parallel.

Contact Information

Maximilian Strauss maximilian.strauss@cpr.ku.dk or mstrauss@biochem.mpg.de Mann Group Novo Nordisk Foundation Center for Protein Research University of Copenhagen

Feel also free to reply to this issue with questions or comments. Thanks!

tobiasko commented 2 years ago

Dear @straussmaximilian ,

I am happy to inform you that your proposal has been selected for the DevMeeting2023! Participants will decide which hackathon to join after the pitch on Monday.

Best, Tobi

tobiasko commented 1 year ago

Hello everyone,

I just created a slack workspace for the DevMeeting and a channel named deepscore for this hack. You should receive an invite to join by email.

Best, Tobi

straussmaximilian commented 1 year ago

Summary:

DeepScore: Community curated scoring, supercharged with AI

Machine- (ML) and, in particular, Deep-Learning (DL) can be successfully applied across the entire analysis pipeline in MS-based proteomics. We investigated potential pitfalls that can arise when they are applied incorrectly. Motivated by creating a dummy decoy classifier that seemingly boosts identifications yet is actually not able to learn, we systematically assessed how we could identify potential data leakage. We found that comparing decoy and false positive distributions is an indicative feature to detect leaked decoy information. Moreover, we found that using a Kolmogorov-Smirnov-type test allows sensitive quantification of this. Our investigation also revealed that up to 5% of identified peptides with Oxidation of Methionine are reassigned when searched again without allowing this modification, calling into question the set false discovery rate of 1%. Additionally, our manual inspection of 1618 timsTOF identifications from a DIA-NN search with a 5% FDR revealed that approximately 34% of the results would not be considered confidently identified by human inspection. Lastly, we investigated the potential of clustering as an unsupervised method for distinguishing targets from decoys in scoring.

EuBIC / EuBIC2023

DeepScore: Community curated scoring, supercharged with AI #14

DeepScore: Community curated scoring, supercharged with AI

Abstract

Project Plan

Assessing Confidence

Interactive Tool

Deep-learning Score

Technical Details

Hardware

Datasets

Feasibility

Contact Information

DeepScore: Community curated scoring, supercharged with AI