Open straussmaximilian opened 2 years ago
Dear @straussmaximilian ,
I am happy to inform you that your proposal has been selected for the DevMeeting2023! Participants will decide which hackathon to join after the pitch on Monday.
Best, Tobi
Hello everyone,
I just created a slack workspace for the DevMeeting and a channel named deepscore for this hack. You should receive an invite to join by email.
Best, Tobi
Summary:
Machine- (ML) and, in particular, Deep-Learning (DL) can be successfully applied across the entire analysis pipeline in MS-based proteomics. We investigated potential pitfalls that can arise when they are applied incorrectly. Motivated by creating a dummy decoy classifier that seemingly boosts identifications yet is actually not able to learn, we systematically assessed how we could identify potential data leakage. We found that comparing decoy and false positive distributions is an indicative feature to detect leaked decoy information. Moreover, we found that using a Kolmogorov-Smirnov-type test allows sensitive quantification of this. Our investigation also revealed that up to 5% of identified peptides with Oxidation of Methionine are reassigned when searched again without allowing this modification, calling into question the set false discovery rate of 1%. Additionally, our manual inspection of 1618 timsTOF identifications from a DIA-NN search with a 5% FDR revealed that approximately 34% of the results would not be considered confidently identified by human inspection. Lastly, we investigated the potential of clustering as an unsupervised method for distinguishing targets from decoys in scoring.
DeepScore: Community curated scoring, supercharged with AI
(Image taken from https://twitter.com/afterglow2046/status/1197271037009973251) The trained deep learning classifier predicts that this image of a dog is an image of Harrison Ford with a 99% probability. The probability that the image of the dog is NOT Harrison Ford is predicted to be 1%. If Harrison Ford is a target and NOT Harrison Ford a decoy, how would this example translate to proteomics?
Abstract
State of the art for identification in mass-spectrometry-based proteomics is to generate “non-sense”, decoy data and to determine a score cutoff based on a false discovery rate. Nowadays, machine (ML) and deep learning (DL) algorithms are used to learn how to optimally distinguish targets from decoys. While this drastically increases sensitivity, it comes at the cost of explainability, and human-chosen acceptance criteria are replaced with black-box models. This can hinder acceptance in clinical practice, e.g., in peptidomics, where upregulated proteins are investigated by investigating raw data peaks. While other DL-driven domains allow straightforward human validation of the models, e.g., imaging, speech, text, or inspecting a predicted structure from AlphaFold, for proteomics, this is much more challenging. Here, we aim to explore the limitations of the current scoring approach and provide potential solutions. We revisit the idea of confidence by trying to artificially increase identifications with non-sense features, hard decoys, or leaking data. Next, we will build an interactive tool to validate identifications manually and assign human confidence scores. With this, we create a training dataset and build an ML or DL- model to rescore identifications and assign predicted human-level confidence scores.
Project Plan
he hackathon will be organized similarly to a SCRUM process. First, we create a user story map to agree on a prioritized list of things we would like to do. Next, we will assess the workload for each task with story points and, depending on the team size and skillset, make a sprint plan on what we can achieve in the given timeframe. Some of the potential tasks could be:
Assessing Confidence
Interactive Tool
Deep-learning Score
Technical Details
The default language should be Python, but open to everything that gets the job done better. The tools mentioned below are suggestions – there are a lot of great tools out there that I am not aware of, and we should collect and decide what to use at the beginning of the hackathon.
Hardware
• Laptop, with decent GPU is a plus (set up drivers etc. for compute) • There is always Google Colab as a fallback • For heavier workloads I have access to a high-performance cluster • Alternatively we could rent on Amazon or related.
Datasets
There are a lot of datasets out there we could use, but this we probably narrow down once we have discussed the scope.
Feasibility
I have some preliminary data with human-curated data and some preliminary tools, so we could start with existing code or start from scratch. Most of the modules can be worked on in parallel.
Contact Information
Maximilian Strauss maximilian.strauss@cpr.ku.dk or mstrauss@biochem.mpg.de Mann Group Novo Nordisk Foundation Center for Protein Research University of Copenhagen
Feel also free to reply to this issue with questions or comments. Thanks!