duplicate entry - Githubissues

cdlawless commented 6 years ago

Hi,

I've found an issue with an mzid xml output from v2018.09.12 when searching my samples with a mouse proteome (UP000000589_10090.fasta, Uniprot rel:March2018).

The issue was highlighted when I went on to run msgf2pin converter.

It highlighted a duplicated entry: XML parser error at C57_F_1_target.mzid:2026098:166

Upon inspection this refers to PepEv_7658488_LSTM+16SPR_32, which violates the XSD schema. (https://www.dropbox.com/s/a6751sdwkxmrill/C57_F_1_target.mzid?dl=0)

Any help/advice appreciated.

Regards

Craig

FarmGeek4Life commented 6 years ago

This is a fight we've had for a while; originally each entry got a forced unique value (i.e., incrementing number), but a change was made because somebody wanted the IDs to match across files, for mzid merging purposes. Even now I don't know how that duplicate occurred, because there is a check specifically prevent this, and it somehow failed or was ignored.

If you could send me the protein sequence for >sp|Q8BMD7|MORC4_MOUSE, it would help me figure out what happened; in the meantime, you can delete one of the PeptideEvidence lines that specify PepEv_7658488_LSTM+16SPR_32, since they are exact duplicates.

verheytb commented 5 years ago

This is becoming a huge issue for me as I am getting multiple duplicate entries, and doing this manually for multiple files is not an option. Is there a suitable workaround that can fix these files in an automated way?

alchemistmatt commented 5 years ago

We run MS-GF+ on hundreds of datasets per week and we don't encounter this problem, so we're unable to fix it. To diagnose further, we'll need one of your buggy .mzid files along with the FASTA file. Please send me your e-mail address using proteomics@pnnl.gov and then I can send you a request to transfer files using https://fx.pnnl.gov/

We'll also need the modification file your using, and the command line you use to start MSGFPlus.jar

alchemistmatt commented 5 years ago

Thank you for providing additional info. Looking at the console output, you used MS-GF+ Release (v2017.07.21) (21 July 2017)

We made a change in May 2018 to fix a peptide evidence problem similar to what you're experiencing; see c9394669e7667072941612f34cd131ca9646cc27

If possible, please re-run your search using the latest version, https://github.com/MSGFPlus/msgfplus/releases/tag/v2018.10.15

I can't guarantee that this will fix things, but hopefully it will.

FarmGeek4Life commented 5 years ago

So, looking at the duplicates that exist in that mzid file, all of the duplicate peptide evidences are exact duplicates, which is what commit c939466 resolved. The issue that @cdlawless had is slightly different; the peptide evidence IDs were duplicated, but there were some differences in start/end/pre/post, which we still haven't resolved.

verheytb commented 5 years ago

Thanks for pointing that out - I was using the latest version from Bioconda. I will try v2018.10.15 and let you know.

Ted

bbhatt1789 commented 5 years ago

Hello. I have the same problem using MSGFPlus_v20181015, the 2018 version. Is there any way to get around the duplicate ids yet? Also trying to run msgf2pin converter.

Thanks, Bhoomi

FarmGeek4Life commented 5 years ago

can you send us the mzid file? We can only fix what we can see and test; we're still trying to find all of the edge cases that are causing problems.

bbhatt1789 commented 5 years ago

Its in the box folder. https://bcm.app.box.com/folder/65633763815

bbhatt1789 commented 5 years ago

https://bcm.app.box.com/folder/65633763815

From: Bryson Gibbons notifications@github.com Sent: Thursday, January 31, 2019 6:15:14 PM To: MSGFPlus/msgfplus Cc: Bhatt, Bhoomi; Comment Subject: Re: [MSGFPlus/msgfplus] duplicate entry (#49)

CAUTION: This email is not from a BCM Source. Only click links or open attachments you know are safe.

can you send us the mzid file? We can only fix what we can see and test; we're still trying to find all of the edge cases that are causing problems.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_MSGFPlus_msgfplus_issues_49-23issuecomment-2D459557235&d=DwMCaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=BKNt-1Bfi227HTnf7iQYBg&m=AKc0FAFnY2lxtKS02L0OOF1JzBOj4bBe2QHbYahlLGs&s=yvbbUbCIY9n60NTNPdJxN5HXIMmB86bRQMPGprqB8vU&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AXhNEg8WUt6u2TmoOmORaU9dx61AGZKLks5vI4cSgaJpZM4XKY-5FY&d=DwMCaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=BKNt-1Bfi227HTnf7iQYBg&m=AKc0FAFnY2lxtKS02L0OOF1JzBOj4bBe2QHbYahlLGs&s=1rGSKf-3X2iEvderM8XVwdQUO5zQfFM31qHb_maS8Zo&e=.

FarmGeek4Life commented 5 years ago

I don't think I can access that; it's expecting a login (permissions, or a public sharing link?)

alchemistmatt commented 5 years ago

@bbhatt1789 send your e-mail address to proteomics@pnnl.gov and we can send you a link to http://fx.pnnl.gov which you can use to transfer the file to us.

bbhatt1789 commented 5 years ago

bbhatt@bcm.edu

FarmGeek4Life commented 5 years ago

Does your mods.txt file have 2 entries for oxidation on residue M, by chance?

alchemistmatt commented 5 years ago

The question regarding the double entry for oxidation is because the .mzid file you provided has two entries for oxidation of Methionine. At present, MS-GF+ does not check for duplicated dynamic (or static) mods.

    <ModificationParams>
      <SearchModification fixedMod="true" massDelta="57.021465" residues="C">
        <cvParam cvRef="UNIMOD" accession="UNIMOD:4" name="Carbamidomethyl"/>
      </SearchModification>
      <SearchModification fixedMod="false" massDelta="15.994915" residues="M">
        <cvParam cvRef="UNIMOD" accession="UNIMOD:35" name="Oxidation"/>
      </SearchModification>
      <SearchModification fixedMod="false" massDelta="15.994915" residues="M">
        <cvParam cvRef="UNIMOD" accession="UNIMOD:35" name="Oxidation"/>
      </SearchModification>
      <SearchModification fixedMod="false" massDelta="0.9840156" residues="N">
        <cvParam cvRef="UNIMOD" accession="UNIMOD:7" name="Deamidated"/>
      </SearchModification>
      <SearchModification fixedMod="false" massDelta="0.9840156" residues="Q">
        <cvParam cvRef="UNIMOD" accession="UNIMOD:7" name="Deamidated"/>
      </SearchModification>
      <SearchModification fixedMod="false" massDelta="-17.026548" residues="Q">
        <SpecificityRules>
          <cvParam cvRef="PSI-MS" accession="MS:1001189" name="modification specificity N-term"/>
        </SpecificityRules>
        <cvParam cvRef="UNIMOD" accession="UNIMOD:28" name="Gln-&gt;pyro-Glu"/>
      </SearchModification>
      <SearchModification fixedMod="false" massDelta="42.010563" residues=".">
        <SpecificityRules>
          <cvParam cvRef="PSI-MS" accession="MS:1002057" name="modification specificity protein N-term"/>
        </SpecificityRules>
        <cvParam cvRef="UNIMOD" accession="UNIMOD:1" name="Acetyl"/>
      </SearchModification>
      <SearchModification fixedMod="false" massDelta="79.96633" residues="S">
        <cvParam cvRef="UNIMOD" accession="UNIMOD:21" name="Phospho"/>
      </SearchModification>
      <SearchModification fixedMod="false" massDelta="79.96633" residues="T">
        <cvParam cvRef="UNIMOD" accession="UNIMOD:21" name="Phospho"/>
      </SearchModification>
      <SearchModification fixedMod="false" massDelta="79.96633" residues="Y">
        <cvParam cvRef="UNIMOD" accession="UNIMOD:21" name="Phospho"/>
      </SearchModification>
    </ModificationParams>

In fact, you have specified a large number of dynamic mods, which will slow the search speed and greatly increases the search space. For a fairly small FASTA file you might be able to get away with this many dynamic mods, but for mammalian samples, you should limit yourself to tryptic peptides and/or search your data multiple times, one time looking for phosphorylated STY along with Carbamidomethyl Cys and possibly oxidized Methionine. Then, in a separate search, look for deamidation and acetylation.

alchemistmatt commented 5 years ago

Release 2019.02.01 adds several validation checks that should catch mistakenly defining the same dynamic modification on the same residue.

bbhatt1789 commented 5 years ago

It works now. Thank you for pointing out the mistake.

verheytb commented 5 years ago

I continue to have this problem with v2018.10.15.

alchemistmatt commented 5 years ago

@verheytb Please try the latest release, https://github.com/MSGFPlus/msgfplus/releases/tag/v2019.02.14 It includes checks for duplicate modification definitions, which might be your issue. Otherwise, you'll need to send us your input file, FASTA file, and command line arguments. Since these files are typically too large for e-mail, if you send your e-mail address to proteomics@pnnl.gov I can send you a https://fx.pnnl.gov/ file transfer request which will allow you to send use the large files using a website.

verheytb commented 5 years ago

I have checked the mods file, and it does not have duplicate entries, so I suspect the problem lies elsewhere. I can send you my files. Thanks for your help,

Ted

From: Matthew Monroe notifications@github.com Sent: Friday, February 15, 2019 9:41 PM To: MSGFPlus/msgfplus Cc: Ted Verhey; Mention Subject: Re: [MSGFPlus/msgfplus] duplicate entry (#49)

@verheytbhttps://github.com/verheytb Please try the latest release, https://github.com/MSGFPlus/msgfplus/releases/tag/v2019.02.14 It includes checks for duplicate modification definitions, which might be your issue. Otherwise, you'll need to send us your input file, FASTA file, and command line arguments. Since these files are typically too large for e-mail, if you send your e-mail address to proteomics@pnnl.govmailto:proteomics@pnnl.gov I can send you a https://fx.pnnl.gov/ file transfer request which will allow you to send use the large files using a website.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/MSGFPlus/msgfplus/issues/49#issuecomment-464290017, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AKFCqstByH6nlmHw1wTDDXwnrKlTGaMKks5vN4vegaJpZM4XKY_Y.

FarmGeek4Life commented 5 years ago

Release 2019.02.27 includes a fix for this; when trying to fix this previously for #24, I failed to make one line of code conditional. There is another change put in place to fix a different output error relating to how MSGF+ internally tries to track protein N-Term methionine cleavage, that led to incorrect start/end locations and pre/post residues on certain PeptideEvidences. Thank you @verheytb for some test files that helped me to test and fix this problem.

alchemistmatt commented 5 years ago

We discovered an issue with the 2019.02.27 release, fixed with f3a67e446890690. Please instead use Release 2019.02.28.

verheytb commented 5 years ago

Excellent, thank you!

MSGFPlus / msgfplus

duplicate entry #49