HUPO-PSI / mzSpecLib

mzSpecLib: A standard format to exchange/distribute spectral libraries
http://www.psidev.info/mzSpecLib
Apache License 2.0
21 stars 14 forks source link

Refine and finalize metadata and CV terms #7

Open ypriverol opened 6 years ago

ypriverol commented 6 years ago

The current msp and other spectral library formats only capture the metadata around each entry in the library (cluster, consensus spectra, peptides, small molecules), but not the way the spectral library has been generated. We need to define a general metadata section at the beginning this metadata. Similar to mztab, I think would be great to have something like:

The MTD version is helping the readers to know that this is a metadata field. The second column is the Key of the metadata attribute and the third is the value of the metadata field.

The following fields can be reused from mzTab:

MTD   mzL-version   1.0.0      
MTD   title  Spectral Library Human from Peptide Atlas 
MTD   id     PXL00000001 
MTD   description Some description that can be used for example in the web about the library
MTD   instrument [MS, MS:1000703, LTQ Orbitrap,]
MTD   instrument [MS, MS:1000008, Velos Orbitrap,]

Can we add to this issue all the fields we think are interesting or important to trace?

ypriverol commented 6 years ago

My list of attributes to define at the level of the library, I will add an * if I think is mandatory.

schymane commented 6 years ago

I'm not sure this is the right place for this comment but note NIST have the MSP but also the SDF format that stores their spectral information (I do not see this mentioned here yet) - although MSP seems to be the more common exchange one, the SDF has the advantage that the full structure AND the spectrum can be in it ... and this paper is a great example of using SDF to do NMR exchange: https://onlinelibrary.wiley.com/doi/abs/10.1002/mrc.4737

rsalek commented 6 years ago

Might be better to keep SDF - structure - separate as it should cover both proteomics and metabolomics or other maybe potentially other MS based applications

schymane commented 6 years ago

See parallel conversation for similar comments! https://github.com/MassBank/MassBank-web/issues/110 I am missing background to the discussions here for sure; have plenty of thoughts for small molecule side but not much idea of how proteomics handles this.

mwang87 commented 6 years ago

I'm not sure if this is addressed in the Massbank format or other formats, but one things we try to track on the GNPS side is the provenance filename and scan number of where the reference spectrum came from. Though, its not perfect in the record and maybe it is more appropriate to be tracked externally (which is done at GNPS) and those records are referenced through an accession number.

ypriverol commented 6 years ago

@mwang87 this issue is to capture what we are planning to trace for the complete spectral library not for the individual spectra. I have created another issue for the individual spectra an cluster https://github.com/HUPO-PSI/SpectralLibraryFormat/issues/9

henryhlam commented 6 years ago

Let's make the largest list possible first (with each field marked as required/optional) and we can whittle it down. Maybe it is easier to edit a Google Doc together?

My quick thoughts below.

At library level, we need:

Format version (e.g. mzl 1.0)

A universal library identifier (similar to the universal spectrum identifier) e.g. mzlib:PXL0000100:NIST_cow_2018)

Publisher/source, including Contact (e.g. NIST)

Publishing date (or library version or serial number)

Library name/descriptor

Software generating the library and version

Organisms (do we need this? if yes, have to allow more multiple organisms, or none at all)

All Modifications (do we need this at the library level? probably only necessary to define special mods not already in PSI-MOD or UNIMOD, or to shorten the tags in each library entry by defining them here)

Instrumentation/Fragmentation (similar. We need this at spectrum level anyway, as many libraries contain mixture of spectra from different instruments. Do we need it here?)

Comments

Provenance (From my experience, it is often necessary to modify/merge/filter libraries to create new custom ones. It would be nice to have a place to keep track of what has been done to the library. (e.g. This library is created from the NIST 2014 one by filtering for all tryptic peptides, and merged it with a decoy library...) This can be put into the Comments field, but it may be useful to have a separate "Provenance" field.)

For spectrum level, it is a lot more complicated. Maybe we should have a separate thread/doc for this.

schymane commented 6 years ago

Re: organisms - it should be designed flexibly to allow extra metadata, but not be too biologically focused, for instance. There are a lot of people who use spectral libraries who do not have any organism context. The MassBank requirement for a "natural / not natural" tag has caused many headaches for us environmental people because we never have the context (caffeine is eg a natural product but for us a chemical found in the environment) and such classifications are extremely hard to auto-classify from the wrong context... (ie please do not force people to provide information they may not have and force them instead to fill in "something" that is likely incorrect just to fill a field).

On Sun, Apr 22, 2018 at 9:36 AM +0200, "henryhlam" notifications@github.com<mailto:notifications@github.com> wrote:

Let's make the largest list possible first (with each field marked as required/optional) and we can whittle it down. Maybe it is easier to edit a Google Doc together?

My quick thoughts below.

At library level, we need:

Format version (e.g. mzl 1.0)

A universal library identifier (similar to the universal spectrum identifier) e.g. mzlib:PXL0000100:NIST_cow_2018)

Publisher/source, including Contact (e.g. NIST)

Publishing date (or library version or serial number)

Library name/descriptor

Software generating the library and version

Organisms (do we need this? if yes, have to allow more multiple organisms, or none at all)

All Modifications (do we need this at the library level? probably only necessary to define special mods not already in PSI-MOD or UNIMOD, or to shorten the tags in each library entry by defining them here)

Instrumentation/Fragmentation (similar. We need this at spectrum level anyway, as many libraries contain mixture of spectra from different instruments. Do we need it here?)

Comments

Provenance (From my experience, it is often necessary to modify/merge/filter libraries to create new custom ones. It would be nice to have a place to keep track of what has been done to the library. (e.g. This library is created from the NIST 2014 one by filtering for all tryptic peptides, and merged it with a decoy library...) This can be put into the Comments field, but it may be useful to have a separate "Provenance" field.)

For spectrum level, it is a lot more complicated. Maybe we should have a separate thread/doc for this.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/HUPO-PSI/SpectralLibraryFormat/issues/7#issuecomment-383361984, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AD4a_cyQU8frF-C_rK0_4eWDHcQGJ2C3ks5trDMEgaJpZM4TcduF.

ypriverol commented 6 years ago

The google document is this one: https://docs.google.com/document/d/1LgSGtR_t5IcUS9rV7YtsLveDVX9X8KsOpU-5NJ4vuYI/edit?usp=sharing

This issue is to discuss the metadata at the level of the library, we have another issue https://github.com/HUPO-PSI/SpectralLibraryFormat/issues/9 to discuss the metadata to the individual spectra

schymane commented 6 years ago

My comment stands for both the individual spectrum and library level ... many spectral libraries will likewise not come from an organism ... although some may and in this case it would be valuable information to be captured. A more generic description may be more flexible? I added some comments to the doc.

ypriverol commented 6 years ago

@schymane The idea of the organisms, instruments, and modifications at the library metadata is for dedicated libraries where for example you the library has been created/filtered for those properties. If is not the case, then those properties should be captured at the spectrum level because it can be huge the number of species, instruments and especially modifications in one library.

ypriverol commented 6 years ago

@henryhlam @edeutsch @sneumann @schymane I have updated the document with the new fields provided by you guys that are needed to capture at the level of the library. Please have a look here https://docs.google.com/document/d/1LgSGtR_t5IcUS9rV7YtsLveDVX9X8KsOpU-5NJ4vuYI/edit

vrkosk commented 6 years ago

When a spectral library is searched, there are two fragment tolerances to consider: the user specifies the tolerance of the input data, and something must specify the tolerance of peaks in the library spectra. These two tolerances could easily differ (e.g. search 10ppm data against a 0.5Da library). It would be nice if fragment tolerance were a file-level attribute.

Also, it's very valuable to allow library entries to override file-level attributes. This allows specifying defaults (e.g. default instrument or default organism). It reduces metadata clutter and still lets you mix entries from different sources in the same file.

edeutsch commented 6 years ago

yes indeed. I put in precursor mass accuracy and fragment mass accuracy as desirable attributes in the library. Mass tolerance strikes me more as a software parameter and a user preference than an inherent property of a library or a spectrum. But I can see other opinions, too.

edeutsch commented 6 years ago

Hi everyone, I have updated the document to reflect notes that I have been taking and the overall direction of the document, which is metadata at all levels, not just the spectrum level or library level. I defined FOUR levels of metadata:

The spectrum level is somewhat further divided into merged spectra, individual spectra, and common to both merged and individual spectra.

I reorganized the document a little. I hope I didn't mess up anything in anyone's view. Have a look and see what you think.

vrkosk commented 6 years ago

I'm glad to see chimeric spectra taken into account. What about intact crosslinked peptides, e.g. disulfide bonds, or looplinked or cyclic peptides?

Protein-level data is inherently problematic in a peptide-centric spectral library. Example problems: there could be no parent protein (e.g. de novo identification); accession format must be restricted (if it's not restricted, it's a free text field); peptide sequence could appear in several locations in a protein; peptide is usually found in more than one protein; consensus spectra could have conflicting parent proteins; accession formats could differ between library entries and between libraries (and will differ if you compare or merge results with a database search). Shouldn't you also record the FASTA name and version if it was a database search? What about the protein description?

I realise this may be an unpopular view, but how about prohibiting protein-level metadata? Or at least move them out of peptide metadata and into the experiment-level metadata. Protein attributes help explain how the peptide was identified rather than being an inherent part of a peptide identification.

jgriss commented 6 years ago

I second @vrkosk view that the protein level information should not be added directly. It is to be expected that spectral libraries will be merged quite heavily. Then, this will definitely cause issues. If we do not add the protein level information from the start, users and search engines will expect that they need to supply a FASTA database. In my opinion, this is the cleanest solution.

ypriverol commented 6 years ago

I guess all information should remain at peptide level for biological entities, and not protein information. I also agree.

edeutsch commented 6 years ago

I agree that we should make sure that cross-linked peptides are supported. I think it is mostly already there with multiple simultaneous identifications already supported. But I suppose we need a flag to distinguish cases when the multiple peptides are chimeric vs. cross-linked. I will add that.

Regarding the encoding of proteins, I would disagree with the prevailing thought that we should prohibit protein information. I certainly agree that it should not be required, and I agree it could get a bit complex. But I suspect some people would like to encode that information and it seems to me that providing a standardized optional way of doing that is a better choice than attempting prohibition.

henryhlam commented 6 years ago

All,

My idealistic and pedantic side would agree with all of you that protein information should not be included. I also struggled with these inconsistency issues when I developed SpectraST.

However, I am with Eric in that keeping the protein information as an optional field is the way to go. In designing a format for everyone to use, we should value continuity and practical utility of the format over semantic purity. Many users of these formats are less into these issues and would just want a format that serves their needs. After all, that's why we started off in last year's PSI deciding that let's see how we can evolve an existing format to something better, rather than tearing up it up and starting from philosophical principles about what a library should be.

The fear is that if we define the "perfect" format that is too far from the existing ones that no one wants to rewrite all existing codes just to fit the new format. So I would advocate a more flexible format with many optional fields, which can accommodate most use cases, and let all existing tools have an easier time switching over. This means we want it to capture most of the useful features of the existing formats, and not so easily dismissed them.

For instance, I am sure NIST puts all those hard-to-decipher fields in there for a reason. They are there to support some functionalities in their tools. If we tell them, sorry, you can't have them any more because we are not supposed to be there, they will just not use our formats, or they will find all kinds of back-door ways to stuff the information back in there. That's not what we want to see.

Back to the specific point of the protein field. The argument for having a protein field is for convenience and efficiency. Efficiency is important!

Typically, users who search a peptide spectral library will want to know what proteins their IDs map to. If the search step is followed by another tool which will do the peptide-protein mapping, then all is well. (This step would require the user to supply a FASTA file.) But sometimes it is not. Remember sequence search engine will naturally provide that protein information, and that's the benchmark that spectral library engines are held up to. From my point of view as the developer of SpectraST, I cannot really tell users that no, a library search is not supposed to tell you that, you need to install another tool. So practically speaking, the spectral search engine will need it do the mapping post-search every time a search is done, not to mention the awkwardness of asking the user to always specify a FASTA file to accompany the library.

The other use is for filtering. Often a user would want to filter his/her library by protein(s). If a protein field is present, then it is a simple thing. If not, then again the user has to look up the protein sequence, get all the possible peptide sequences of that protein, and then do a search by peptide.

By the way, SpectraST already has a function to re-map all library entries to proteins, based on a given FASTA file. If user downloads a library but would like to use their own set of protein identifiers, it can do the re-mapping. It can be used to fixed errors in the mapping, or update to a new FASTA file. But if you don't allow me to store the protein somewhere in the library file, then I have to do this mapping every time a search is done!

The reality is that most peptides map to a small number of proteins, and the mapping is quite stable. We are here to deal with 95% of the cases, not the 5%. As long as the field is not mandatory, and we allow multiple proteins, it will serve all purposes and not break anything. Ultimately we have to trust the tools to use these fields wisely. It will make the tools run faster and minimize unnecessary repeated tasks.

I understand the argument that the protein is really not part of the analyte -- it is merely where it occurs in the natural world -- so it should not be stored with the library entry. We are saying, essentially, the source or any auxiliary information about the analyte should not be stored. But then what about organism? What about target/decoy? (The tool can figure that out from trying to map it to the FASTA! No need to store that field either.) What about natural/synthetic (the metabolomics people will want this field)? Oh, look up in some online database instead -- none of the business of the library. Synonyms of metabolites? Too messy, just store the InChI key and let the user look it up themselves. None of this has anything to do with the one-to-one correspondence between the analyte and its characteristic fragmentation pattern, which is, in a pure sense, what a library entry should be about. Are we really going to go down the road of cutting out anything that should not be part of this correspondence?

I think our overriding concern, at this point of the exercise, should be to ensure that all existing tools are willing to switch over. If we make it too hard on the tool developer or the user, then we may have a beautiful and well-designed format that no one will use.

Henry

On Thu, Apr 26, 2018 at 4:13 AM, Eric Deutsch notifications@github.com wrote:

I agree that we should make sure that cross-linked peptides are supported. I think it is mostly already there with multiple simultaneous identifications already supported. But I suppose we need a flag to distinguish cases when the multiple peptides are chimeric vs. cross-linked. I will add that.

Regarding the encoding of proteins, I would disagree with the prevailing thought that we should prohibit protein information. I certainly agree that it should not be required, and I agree it could get a bit complex. But I suspect some people would like to encode that information and it seems to me that providing a standardized optional way of doing that is a better choice than attempting prohibition.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/HUPO-PSI/SpectralLibraryFormat/issues/7#issuecomment-384418981, or mute the thread https://github.com/notifications/unsubscribe-auth/Aa0goiqw4IQmTIsTGLVNjUIYpgfGuUN8ks5tsNjugaJpZM4TcduF .

-- Henry H. N. Lam Associate Professor Department of Chemical and Biological Engineering Hong Kong University of Science and Technology Phone: 2358-7133 Fax: 2358-0054 Email: kehlam@ust.hk

schymane commented 6 years ago

I agree wholeheartedly. We should be flexible, be able to store all existing information from existing libraries (even if we don't necessarily see the use) and by allowing apppropriate optional (not mandatory) fields we should be able to achieve this.

ypriverol commented 6 years ago

By design, every file format from PSI has the options to add more additional properties by using in some cases CVParams, UsersPArams (mzIdentML, mzML); or optional columns (mzTab). In the current document, we are specifying which are the fields we want to capture, with their cardinality and we should guarantee that every section has a mechanism to add additional fields as CVParams. In the specification document, we can add a section about how to report protein information.

By design, we should have the flexibility to add the information and protein information is one of those cases.

javizca commented 6 years ago

In my view, there would need to be a way to encode protein level information, since some people/tools may need it. What I would avoid is to capture all the underlying complexity related to protein inference. That would ideally go somewhere else. Analogous concepts would be applicable also to analytes coming from metabolomics, lipidomics, etc.

RalfG commented 4 years ago

To update this issue:

All metadata is listed in the following document https://docs.google.com/document/d/1rN5DJSowp2micxlwJQlPxlv39ZiaLEfv/edit.

When editing, ALWAYS make sure you are using the "Suggesting" mode. To activate this click on View > Mode > Suggesting.

sneumann commented 4 years ago

Hi, can you open the Document in "Comment" Mode for all with the link ? That should allow "Suggest mode" editing. And or respond to the "request permission" notification I sent ?

Thanks, Yours, Steffen


Von: Ralf Gabriels notifications@github.com Gesendet: Montag, 18. November 2019 11:10 An: HUPO-PSI/SpectralLibraryFormat Cc: Neumann, Steffen; Mention Betreff: Re: [HUPO-PSI/SpectralLibraryFormat] We need to capture the metadata around the Spectral library (#7)

To update this issue:

All metadata is listed in the following document https://docs.google.com/document/d/1rN5DJSowp2micxlwJQlPxlv39ZiaLEfv/edit.

When editing, ALWAYS make sure you are using the "Suggesting" mode. To activate this click on View > Mode > Suggesting.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/HUPO-PSI/SpectralLibraryFormat/issues/7?email_source=notifications&email_token=AABPWOKFY33KLE4PUDS7ESLQUJS3DA5CNFSM4E3R3OC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEJ5GSQ#issuecomment-554947402, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AABPWOJMHC6XIHIMDOWXZT3QUJS3DANCNFSM4E3R3OCQ.

edeutsch commented 4 years ago

Switched document to all with link can suggest, as requested.

edeutsch commented 9 months ago

Newer document: https://docs.google.com/document/d/1o11m7grfHvMzfbTozvDY0twJ1g2dzk6I/edit

But @RalfG is working on a system to encode this in JSON. See ongoing work in https://github.com/HUPO-PSI/mzSpecLib/pull/73

Next when we have @RalfG on the Friday call, we should spend some time with this table of information.