PRIDE-Archive / ms-data-core-api

Open-source java library to handle different file format standards for proteomics. Specially ms-data-core-api is good for MetaData representation.
Apache License 2.0
12 stars 8 forks source link

Question about method MzIdentMLUtils.getSpectrumId #30

Open colin-combe opened 6 years ago

colin-combe commented 6 years ago

Hi - we have been looking at different ways of parsing mzIdentML files. PRIDEs ms-data-core-api looks to be the most complete solution, for example, other libraries do not deal with the different formats used for spectrum ids.

I have a question regarding the formats used for spectrum ids and the code at: https://github.com/PRIDE-Utilities/ms-data-core-api/blob/4b5f9f8d8a87c03b37a9652492a95aec029c1ca9/src/main/java/uk/ac/ebi/pride/utilities/data/utils/MzIdentMLUtils.java#L55-L82

As I read it, in the case where the fileIdFormat is Constants.SpecIdFormat.MASCOT_QUERY_NUM or Constants.SpecIdFormat.MULTI_PEAK_LIST_NATIVE_ID then one is added to the spectrum id. I took this to mean that these formats use zero-based indexes whereas the norm for these formats is to use one based indexes.

This is the case for the multiple peak list nativeID format (MS:1000774) which says:

Index is the spectrum number in the file, starting from 0.

However, for the Mascot query number (MS:1001528) it says:

The spectrum (query) number in a Mascot results file, starting from 1.

So, finally getting to my question, why is one added to the spectrum id for Mascot query number format when the corresponding CV term says it is already one-based?

cheers, Colin

sureshhewabi commented 6 years ago

I think it is a good question and we need to sort that out. By any chance, if you have any file which uses the Mascot query number (MS:1001528) format in CV param, please send us, which helps us for debugging. Thanks

colin-combe commented 6 years ago

I don't have an example, I also can't find one - I just noticed the seeming inconsistency between the definition of the cv term and the code.

However, its maybe not such a big problem - I notice the mzIdentML 1.2.0 schema (para 5.1.2) does not list MS:1001528 as a legal way of referencing a spectrum identification.

colin-combe commented 6 years ago

There is an example mzid file using MS:1001528 at ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2014/01/PXD000198/1007.mzid

sureshhewabi commented 6 years ago

It seems this project is a partial project and the spectra reference(file:///DATA.TXT) inside 1007.mzid file is missing in the project. We might have to find a better example.

MS:1001528 is not even allowed in 1.1.0. version. I think we might need to remove that code in next release to avoid any confusions.