Converting MS-GF+ mzid output to pepXML

kevinkovalchik commented 3 years ago

Describe the question or problem I'm unsure if this is a bug or if I just need help. I am unable to convert the mzid files generated by MS-GF+ into pepXML files using idconvert.

Details Running a search with unspecific digest, I get the mzid output fine. But when trying to convert to pepXML using the idconvert which is distributed with the TPP I am getting the following error:

processing file: ./YE_20180428_SK_HLA_A0202_3Ips_a50mio_R1_01.msgf.mzid
writing output file: ./pepxml_results/YE_20180428_SK_HLA_A0202_3Ips_a50mio_R1_01.pepXML
Error writing analysis 1 in "YE_20180428_SK_HLA_A0202_3Ips_a50mio_R1_01.msgf.mzid":
[EnzymePtr_name] No enzyme name or regular expression.

When I try the same conversion on a file from a tryptic search it works fine.

Useful extras MS-GF+ version: Release (v2021.01.08) (8 January 2021) OS: Ubuntu 20.04 TPP version: 5.2 Information on idconvert release:

ProteoWizard release: 3.0.11841 (TPP v6.0.0-rc6 Noctilucent, Build 202102220850-8394 (Linux-x86_64)) (0-0-0 (unknown revision)) ProteoWizard IdentData: 3.0.11841 (TPP v6.0.0-rc6 Noctilucent, Build 202102220850-8394 (Linux-x86_64)) (0-0-0 (unknown revision)) Build date: Feb 22 2021 09:01:48

This same issue was mentioned on the ProteoWizard mailing list a while ago. While there was not much in the way of follow up, they seemed to think it was not an idconvert issue, but I'm not sure: https://sourceforge.net/p/proteowizard/mailman/message/32673464/

I looked into the mzid file and found this for cleavage info:

    <Enzymes>
      <Enzyme semiSpecific="true" missedCleavages="-1" id="UnspecificCleavage">
        <EnzymeName>
          <cvParam cvRef="PSI-MS" accession="MS:1001956" name="unspecific cleavage"/>
        </EnzymeName>
      </Enzyme>
    </Enzymes>

It seems to look okay to me, though I am not an expert on mzid specifications. However, if I make this change (adding a name tag to the Enzyme element) then the conversion runs fine:

    <Enzymes>
      <Enzyme semiSpecific="true" missedCleavages="-1" id="UnspecificCleavage" name="unspecific cleavage">
        <EnzymeName>
          <cvParam cvRef="PSI-MS" accession="MS:1001956" name="unspecific cleavage"/>
        </EnzymeName>
      </Enzyme>
    </Enzymes>

This looks weird to me because the name tag is duplicated. What do you think? Is it an issue with MS-GF+ or with idconvert? Or with something else I am doing wrong?

Thanks!

Kevin

FarmGeek4Life commented 3 years ago

@kevinkovalchik Can you share the mzid file? @chambm any thoughts? Making MS-GF+ output the name attribute on "Enzyme" to fix this issue feels like an unnecessary hack.

alchemistmatt commented 3 years ago

Looking at the mzIdentML 1.1.1 spec, name is an optional attribute for the <Enzyme> element. In contrast, name is required for the <cvParam> element. I highly suspect that idconvert is treating name as a required attribute for <Enzyme>. We will update MS-GF+ to create .mzid files with a name attribute in the <Enzyme> element; it's a minor change that matches the spec.

Here is the example enzyme entry from https://github.com/HUPO-PSI/mzIdentML/blob/master/specification_document/specdoc1_1/mzIdentML1.1.1.doc

<Enzymes>
  <Enzyme id="ENZ_0" cTermGain="OH" nTermGain="H" semiSpecific="0">
    <SiteRegexp><![CDATA[(?<=[KR])(?!P)]]></SiteRegexp>
    <EnzymeName>
      <cvParam accession="MS:1001251" name="Trypsin" cvRef="PSI-MS"/>
    </EnzymeName>
  </Enzyme>
  ...
</Enzymes>

Notice that name is not an attribute for <Enzyme> . But, like I said, it's optional:

Attribute Name: id
Data Type:      xsd:string
Use:            required
Definition:     An identifier is an unambiguous string that is unique within the scope (i.e. a document, a set of related documents, or a repository) of its use. 

Attribute Name: name
Data Type:      xsd:string
Use:            optional
Definition:     The potentially ambiguous common identifier, such as a human-readable name for the instance. 

Attribute Name: semiSpecific
Data Type:      xsd:boolean
Use:            optional
Definition:     Set to true if the enzyme cleaves semi-specifically (i.e. one terminus MUST cleave according to the rules, the other can cleave at any residue), false if the enzyme cleavage is assumed to be specific to both termini (accepting for any missed cleavages).

alchemistmatt commented 3 years ago

Looking into this, MS-GF+ uses jmzIdentML to create the .mzid file: https://github.com/PRIDE-Utilities/jmzIdentML

That means that updating things to include a name attribute for the <Enzyme> element will be harder than I thought. It's entirely possible that we'd have to clone their repo to make the change, which is less than ideal. The alternative is to create the .mzid file then post-process it to insert the name attribute. Better yet would be for @chambm to update idconvert to not require <Enzyme> to have a name attribute

FarmGeek4Life commented 3 years ago

Thank you for your work @chambm It's always fun finding 9-year-old bugs, right?

chambm commented 3 years ago

:champagne: It's a testament to how few people use idconvert, or use unspecific searches, or some mix of those. ;)

kevinkovalchik commented 3 years ago

Indeed, it is not the most common of combinations... Thanks for working on it!

MSGFPlus / msgfplus

Converting MS-GF+ mzid output to pepXML #117