dmccloskey commented 7 years ago

Objectives

[x] method to match a spectrum to a spectrum database
[ ] support for the mzIdentML format to report match
[x] integration with 3rd party databases (e.g., NIST GC-MS database)
[ ] test support with example data

Feature plan

What is the format of the DB?
What support does OpenMS have for reading the DB format?
How can we modify existing classes to get our functionality?

Pre-existing OpenMS classes that maybe of help:

SpectraSTAdapter
SpectralLIbSearch (peptides)
MetaboliteSpectralMatcher (compounds)

dmccloskey commented 6 years ago

Other resources from Hanne:

./src/topp/SpecLibSearcher.cpp ./src/utils/MetaboliteSpectralMatcher.cpp https://github.com/OpenMS/OpenMS/pull/2874 -- FeatureFinderMetaboIdent

Sebastian Boecker paper https://www.biorxiv.org/content/early/2017/02/17/109389

spectral comparison fxn: https://github.com/OpenMS/OpenMS/tree/develop/src/openms/source/COMPARISON/SPECTRA

SpectrumCheapDPCorr.cpp
SpectraSTSimilarityScore.cpp
BinnedSpectralContrastAngle.cpp

timosachsenberg commented 6 years ago

HI, @dmccloskey I recently rewrote the underlying classes of SpecLibSearcher. When I looked at the code, I realized that it should be easily applicable to metabolomics. This renders MetaboliteSpectralMatcher a bit superfluous. There are some minor differences though in terms of supported file formats. MetaboliteSpectralMatcher only supports an in-house mzML file as database. SpecLibSearcher tries to support more of the existing database formats - but right now is more focused on proteomics. I think it would be a good idea to consider SpecLibSearcher as it has the better underlying datastructures. It's also reasonably fast (on our node about 250 000 comparisons / s) with the new datastructures.

dmccloskey commented 6 years ago

Hi @timosachsenberg, OK we will base the matchSpectrum method on the functionality of SpecLibSearcher. Do you have a list input file formats that SpecLibSearcher supports? It would be interesting for us to see if they match what is given by e.g., the NIST database.

timosachsenberg commented 6 years ago

I recently gave it a try when I improved the core data structures and I realized that reading the spectral databases is still a weak part in OpenMS (e.g., they are also not very well standardized). It should not be too much work to get this working but probably requires some additional code for parsing. I just did not find the time yet to do so but I could certainly provide some help.

dmccloskey commented 6 years ago

Hi @timosachsenberg, we should have some time to tackle this problem. Do you mind giving us an overview of what is required to parse the spectral databases? We can also setup a Skype meeting to go over it if there are too many technical aspects needed to discuss on the comments.

timosachsenberg commented 6 years ago

Hmm I honestly don't know. I think we currently don't have access to the NIST spectra as these seem to be only commercially available. Happy to have a quick skype session this week.

dmccloskey commented 6 years ago

@timosachsenberg: I have a copy of it actually. I am very unfamiliar with the file formats so I am not sure what files correspond to the correct format.

A Skype session would be great. Tentative goals for the meeting:

OpenMS support for various spectral library DB formats: support for .msp format; confirmed that NIST spectra can be exported into this format
OpenMS methods to read and write spectral library DBs: https://github.com/OpenMS/OpenMS/blob/develop/src/openms/include/OpenMS/FORMAT/MSPFile.h
OpenMS informatics methods for matching compound spectras: 1. discreet spectrum representation (e.g., BinSpectrum); 2. Peak matching between spectra
OpenMS scoring algorithms: cosine similarity

dmccloskey commented 6 years ago

Examples from GC MS flux applications

"The samples was derivatized with MSTFA and e.g. I got a NIST match (score 683) for G3P at RT 13.30 min (NIST name: Phosphoric acid, bis(trimethylsilyl) 2,3-bis[(trimethylsilyl)oxy]propyl ester) and for G6P (NIST score 669) at RT 16.72 (NIST name d-Glucose, 2,3,4,5-tetrakis-O-(trimethylsilyl)-, o-methyloxime, 6-[bis(trimethylsilyl) phosphate])" -- Mette
GCMS full scan .mzXML file

pcolaianni commented 6 years ago

Reformatting the info:

The samples was derivatized with MSTFA. I got a NIST match for:

G3P (NIST score 683) at RT 13.30 min (NIST name: Phosphoric acid, bis(trimethylsilyl) 2,3-bis[(trimethylsilyl)oxy]propyl ester)
G6P (NIST score 669) at RT 16.72 min (NIST name d-Glucose, 2,3,4,5-tetrakis-O-(trimethylsilyl)-, o-methyloxime, 6-[bis(trimethylsilyl) phosphate])

dmccloskey commented 6 years ago

.MSP format for metabolomics

Required fields

Name
Comment
Num peaks

Optional fields

All other fields

pcolaianni commented 6 years ago

[x] show matches' synonyms

dmccloskey commented 6 years ago

Use cases for MatchSpectrum

GC-MS (MS1)
GC-MS/MS (MS1 and MS2)
LC-MS (MS1)
LC-MS/MS (MS1 and MS2)

Treatment of precursor and product Spectra for each use case

precursor = 0, disable mass check, retain RT check
need to see an example of an mzML file taken from a GC-MS/MS experiment
same as 1
work as is

MatchSpectra algorithm testing

report a large number of spectra (e.g., 100) to see if the expected compound is reported
create parameter for bin_size and bin_spread (e.g., bin_size = 1.0, bin_spread = 0.0)

dmccloskey commented 6 years ago

Targeted mode matchSpectra

look up the TraML file compound name in the provided .msp spectral library using the "name" attribute
The name is not guaranteed to be unique, so if multiple spectra are found, the average spectra score will be reported
calculate a score between the user provided spectra and the .msp spectral library (average if multiple spectra are found)
report the spectral match in the FeatureMap using the metaValue "library_spectral_score_average" "library_spectral_score_n_spectra", "library_spectral_score_stdev", and "library_spectral_comment"

Check for the case of e.g., "Hexestrol"

Untargeted mode matchSpectra

report spectral matches and scores using the mzTab format (match spectra)
create a new featureMap that reports the annotated compound name, retention time, quality score for scores that are above a certain threshold. (new method not yet implemented)

dmccloskey commented 6 years ago

Problems and Solutions

problem: MSP file parsing speed was slow
solution: switching from vector to set, optimization of regular expressions (simplier and fewer), and removal of LOG_DEBUG
problem: spectral data for tests using Database X
solution: utilize a few dummy spectrum instead of the Database X spectrum

biosustain / OpenMS

SpectrumExtractor::MatchSpectrum #4

Objectives

Feature plan

Examples from GC MS flux applications

.MSP format for metabolomics

Required fields

Optional fields

Use cases for MatchSpectrum

Treatment of precursor and product Spectra for each use case

MatchSpectra algorithm testing

Targeted mode matchSpectra

Untargeted mode matchSpectra

Problems and Solutions