Nesvilab / MSFragger

Ultrafast, comprehensive peptide identification for mass spectrometry–based proteomics
https://msfragger.nesvilab.org
105 stars 7 forks source link

Add nativeID output in mzIdentML/pepXML[/mzML/PIN] #324

Open chambm opened 3 months ago

chambm commented 3 months ago

To preserve Waters and Sciex source spectrum links, writing nativeID in the output pepXML/mzIdentML is necessary. Please read in the nativeID when reading spectra and pass it through when writing pepXML/mzIdentML elements for those spectra. It would be good to preserve it in mzML as well, but that's not as important. If the format allows, it might be helpful to write it in Percolator PIN format as well so it's simple to map PIN lines to the mzIdentML/pepXML equivalent.

Thanks!

fcyu commented 2 months ago

Hi Matt,

Sorry for the long delay. I finally got a chance to implement this feature. If there is not too much trouble, could you share some typical Waters and Sciex mzML files with me to test?

If the format allows, it might be helpful to write it in Percolator PIN format as well so it's simple to map PIN lines to the mzIdentML/pepXML equivalent.

I don't want to change the SpecId column because many downstream tools parse that columns. As far as I know, there is no additional column can be used for the native ID. Let me know if the latest Percolator support the native ID column.

Also, may I ask if there is any harm to make the "index" not starting from 0 and not continuous? I would like to use scan num - 1 as the index to make it consistent when the mzML file is just a subset of the scans.

Thanks,

Fengchao

chambm commented 2 months ago

I'm glad to hear this is almost done!

Unfortunately AFAIK the mzML index must be 0-based and contiguous: https://peptideatlas.org/tmp/mzML1.1.0.html#spectrum Usually if you make an mzML from a format where you don't have a nativeID, only a scan number, you would just make the nativeID like "scan=123" or "index=122". But you already have a real nativeID. The problem here is to map the mzML/pepXML to the PIN TSV, right? https://github.com/percolator/percolator/wiki/Interface#pintsv-tab-delimited-file-format

As far as I can tell from that, there should be a string PsmId column and a numeric ScanNr column. It seems pretty typical for the ScanNr column to be missing though. I can understand not wanting to change the PsmId format you've been using, but that's really the only column suitable for the nativeID. :(

Maybe easiest would just be to guarantee that the number and order of lines in the pepXML is the same in the PIN?

chambm commented 2 months ago

Here's an example Waters DDA file. 010208_ecoli_003-dda2.zip

fcyu commented 2 months ago

Mapping the mzML/pepXML to the pin file is actually OK as long as we have a consistent way to extract the scan number (from native ID if it is encoded in 1-D such as Thermo's, or index + 1 if it is not in 1-D such as Waters' and Sciex's). I asked because you want it. If it is OK not having the native ID in the pin file, I guess I can ignore it.

The problem is that if there is a mzML file that is a subset of the original mzML file, and its native ID does not encode the scan number in 1-D, like what Waters and Sciex have. Then, since the scan number = index + 1, the scan numbers in the subset mzML are different from those in the original mzML, and it is hard to map across different tools. The way I think are not generating the sub mzML file or make the index = scan number - 1 (which will not start with 0 and not contiguous)

Maybe in the future, the mzML schema can have a scan_number field for the tools to put their own-defined scan numbers. Then, still need those tools to support it.....

Best,

Fengchao

chambm commented 2 months ago

You could use a userParam. Those are arbitrary and basically unlimited.

But I think nativeID is specifically intended and useful for mapping across different tools, and for remaining valid when files are filtered or subsetted. It's why I started putting spectrumNativeID in my pepXML output, even though that wasn't an official attribute. :)

fcyu commented 2 months ago

But I think nativeID is specifically intended and useful for mapping across different tools, and for remaining valid when files are filtered or subsetted. It's why I started putting spectrumNativeID in my pepXML output, even though that wasn't an official attribute. :)

Yes, but then, I need to maintain native ID -> scan number and scan number -> native ID maps in all the tools that read mzML and raw files because we index scans using 1-D.

Best,

Fengchao

chambm commented 2 months ago

In my tools I made nativeID a field in the Spectrum class and had a map from nativeID to Spectrum*. When possible, I dropped scan number entirely because it wasn't universally applicable, and when not possible, I parsed it out of the nativeID (or used index if not parseable).

chambm commented 1 month ago

Hi Fengchao, has this change made it into a released MSFragger?

fcyu commented 1 month ago

For Thermo data, the spectrumNativeID is already in the pepXML file. For the others that require changing the scan indexing, I am trying to get it done before the next release.

Best,

Fengchao

fcyu commented 3 weeks ago

Hi @chambm , I added the Waters and other vendor's support. Here are the pepXML and _calibrated.mzML files generated by the MSFragger: 010208_ecoli_003-dda2_calibrated.zip. Basically, I used the approach we discussed in https://github.com/HUPO-PSI/psi-ms-CV/issues/343#issuecomment-2400189182.

I will send the pre-released version of MSFragger by email.

Let me know if you have any questions or suggestions.

Thanks,

Fengchao