biosustain / OpenMS

The codebase of the OpenMS project
https://www.openms.de
Other
0 stars 1 forks source link

SpectrumExtractor::MatchSpectrum #4

Open dmccloskey opened 7 years ago

dmccloskey commented 7 years ago

Objectives

Feature plan

Pre-existing OpenMS classes that maybe of help:

dmccloskey commented 6 years ago

Other resources from Hanne:

./src/topp/SpecLibSearcher.cpp ./src/utils/MetaboliteSpectralMatcher.cpp https://github.com/OpenMS/OpenMS/pull/2874 -- FeatureFinderMetaboIdent

Sebastian Boecker paper https://www.biorxiv.org/content/early/2017/02/17/109389

spectral comparison fxn: https://github.com/OpenMS/OpenMS/tree/develop/src/openms/source/COMPARISON/SPECTRA

timosachsenberg commented 6 years ago

HI, @dmccloskey I recently rewrote the underlying classes of SpecLibSearcher. When I looked at the code, I realized that it should be easily applicable to metabolomics. This renders MetaboliteSpectralMatcher a bit superfluous. There are some minor differences though in terms of supported file formats. MetaboliteSpectralMatcher only supports an in-house mzML file as database. SpecLibSearcher tries to support more of the existing database formats - but right now is more focused on proteomics. I think it would be a good idea to consider SpecLibSearcher as it has the better underlying datastructures. It's also reasonably fast (on our node about 250 000 comparisons / s) with the new datastructures.

dmccloskey commented 6 years ago

Hi @timosachsenberg, OK we will base the matchSpectrum method on the functionality of SpecLibSearcher. Do you have a list input file formats that SpecLibSearcher supports? It would be interesting for us to see if they match what is given by e.g., the NIST database.

timosachsenberg commented 6 years ago

I recently gave it a try when I improved the core data structures and I realized that reading the spectral databases is still a weak part in OpenMS (e.g., they are also not very well standardized). It should not be too much work to get this working but probably requires some additional code for parsing. I just did not find the time yet to do so but I could certainly provide some help.

dmccloskey commented 6 years ago

Hi @timosachsenberg, we should have some time to tackle this problem. Do you mind giving us an overview of what is required to parse the spectral databases? We can also setup a Skype meeting to go over it if there are too many technical aspects needed to discuss on the comments.

timosachsenberg commented 6 years ago

Hmm I honestly don't know. I think we currently don't have access to the NIST spectra as these seem to be only commercially available. Happy to have a quick skype session this week.

dmccloskey commented 6 years ago

@timosachsenberg: I have a copy of it actually. I am very unfamiliar with the file formats so I am not sure what files correspond to the correct format.

A Skype session would be great. Tentative goals for the meeting:

  1. OpenMS support for various spectral library DB formats: support for .msp format; confirmed that NIST spectra can be exported into this format
  2. OpenMS methods to read and write spectral library DBs: https://github.com/OpenMS/OpenMS/blob/develop/src/openms/include/OpenMS/FORMAT/MSPFile.h
  3. OpenMS informatics methods for matching compound spectras: 1. discreet spectrum representation (e.g., BinSpectrum); 2. Peak matching between spectra
  4. OpenMS scoring algorithms: cosine similarity
dmccloskey commented 6 years ago

Examples from GC MS flux applications

pcolaianni commented 6 years ago

Reformatting the info:

The samples was derivatized with MSTFA. I got a NIST match for:

dmccloskey commented 6 years ago

.MSP format for metabolomics

Required fields

Optional fields

pcolaianni commented 6 years ago
dmccloskey commented 6 years ago

Use cases for MatchSpectrum

  1. GC-MS (MS1)
  2. GC-MS/MS (MS1 and MS2)
  3. LC-MS (MS1)
  4. LC-MS/MS (MS1 and MS2)

Treatment of precursor and product Spectra for each use case

  1. precursor = 0, disable mass check, retain RT check
  2. need to see an example of an mzML file taken from a GC-MS/MS experiment
  3. same as 1
  4. work as is

MatchSpectra algorithm testing

  1. report a large number of spectra (e.g., 100) to see if the expected compound is reported
  2. create parameter for bin_size and bin_spread (e.g., bin_size = 1.0, bin_spread = 0.0)
dmccloskey commented 6 years ago

Targeted mode matchSpectra

  1. look up the TraML file compound name in the provided .msp spectral library using the "name" attribute
  2. The name is not guaranteed to be unique, so if multiple spectra are found, the average spectra score will be reported
  3. calculate a score between the user provided spectra and the .msp spectral library (average if multiple spectra are found)
  4. report the spectral match in the FeatureMap using the metaValue "library_spectral_score_average" "library_spectral_score_n_spectra", "library_spectral_score_stdev", and "library_spectral_comment"

Check for the case of e.g., "Hexestrol"

Untargeted mode matchSpectra

  1. report spectral matches and scores using the mzTab format (match spectra)
  2. create a new featureMap that reports the annotated compound name, retention time, quality score for scores that are above a certain threshold. (new method not yet implemented)
dmccloskey commented 6 years ago

Problems and Solutions