Closed cdlawless closed 5 years ago
This is a fight we've had for a while; originally each entry got a forced unique value (i.e., incrementing number), but a change was made because somebody wanted the IDs to match across files, for mzid merging purposes. Even now I don't know how that duplicate occurred, because there is a check specifically prevent this, and it somehow failed or was ignored.
If you could send me the protein sequence for >sp|Q8BMD7|MORC4_MOUSE, it would help me figure out what happened; in the meantime, you can delete one of the PeptideEvidence lines that specify PepEv_7658488_LSTM+16SPR_32, since they are exact duplicates.
This is becoming a huge issue for me as I am getting multiple duplicate entries, and doing this manually for multiple files is not an option. Is there a suitable workaround that can fix these files in an automated way?
We run MS-GF+ on hundreds of datasets per week and we don't encounter this problem, so we're unable to fix it. To diagnose further, we'll need one of your buggy .mzid files along with the FASTA file. Please send me your e-mail address using proteomics@pnnl.gov and then I can send you a request to transfer files using https://fx.pnnl.gov/
We'll also need the modification file your using, and the command line you use to start MSGFPlus.jar
Thank you for providing additional info. Looking at the console output, you used
MS-GF+ Release (v2017.07.21) (21 July 2017)
We made a change in May 2018 to fix a peptide evidence problem similar to what you're experiencing; see c9394669e7667072941612f34cd131ca9646cc27
If possible, please re-run your search using the latest version, https://github.com/MSGFPlus/msgfplus/releases/tag/v2018.10.15
I can't guarantee that this will fix things, but hopefully it will.
So, looking at the duplicates that exist in that mzid file, all of the duplicate peptide evidences are exact duplicates, which is what commit c939466 resolved. The issue that @cdlawless had is slightly different; the peptide evidence IDs were duplicated, but there were some differences in start/end/pre/post, which we still haven't resolved.
Thanks for pointing that out - I was using the latest version from Bioconda. I will try v2018.10.15 and let you know.
Ted
Hello. I have the same problem using MSGFPlus_v20181015, the 2018 version. Is there any way to get around the duplicate ids yet? Also trying to run msgf2pin converter.
Thanks, Bhoomi
can you send us the mzid file? We can only fix what we can see and test; we're still trying to find all of the edge cases that are causing problems.
Its in the box folder. https://bcm.app.box.com/folder/65633763815
https://bcm.app.box.com/folder/65633763815
From: Bryson Gibbons notifications@github.com Sent: Thursday, January 31, 2019 6:15:14 PM To: MSGFPlus/msgfplus Cc: Bhatt, Bhoomi; Comment Subject: Re: [MSGFPlus/msgfplus] duplicate entry (#49)
CAUTION: This email is not from a BCM Source. Only click links or open attachments you know are safe.
can you send us the mzid file? We can only fix what we can see and test; we're still trying to find all of the edge cases that are causing problems.
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_MSGFPlus_msgfplus_issues_49-23issuecomment-2D459557235&d=DwMCaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=BKNt-1Bfi227HTnf7iQYBg&m=AKc0FAFnY2lxtKS02L0OOF1JzBOj4bBe2QHbYahlLGs&s=yvbbUbCIY9n60NTNPdJxN5HXIMmB86bRQMPGprqB8vU&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AXhNEg8WUt6u2TmoOmORaU9dx61AGZKLks5vI4cSgaJpZM4XKY-5FY&d=DwMCaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=BKNt-1Bfi227HTnf7iQYBg&m=AKc0FAFnY2lxtKS02L0OOF1JzBOj4bBe2QHbYahlLGs&s=1rGSKf-3X2iEvderM8XVwdQUO5zQfFM31qHb_maS8Zo&e=.
I don't think I can access that; it's expecting a login (permissions, or a public sharing link?)
@bbhatt1789 send your e-mail address to proteomics@pnnl.gov and we can send you a link to http://fx.pnnl.gov which you can use to transfer the file to us.
bbhatt@bcm.edu
Does your mods.txt file have 2 entries for oxidation on residue M, by chance?
The question regarding the double entry for oxidation is because the .mzid file you provided has two entries for oxidation of Methionine. At present, MS-GF+ does not check for duplicated dynamic (or static) mods.
<ModificationParams>
<SearchModification fixedMod="true" massDelta="57.021465" residues="C">
<cvParam cvRef="UNIMOD" accession="UNIMOD:4" name="Carbamidomethyl"/>
</SearchModification>
<SearchModification fixedMod="false" massDelta="15.994915" residues="M">
<cvParam cvRef="UNIMOD" accession="UNIMOD:35" name="Oxidation"/>
</SearchModification>
<SearchModification fixedMod="false" massDelta="15.994915" residues="M">
<cvParam cvRef="UNIMOD" accession="UNIMOD:35" name="Oxidation"/>
</SearchModification>
<SearchModification fixedMod="false" massDelta="0.9840156" residues="N">
<cvParam cvRef="UNIMOD" accession="UNIMOD:7" name="Deamidated"/>
</SearchModification>
<SearchModification fixedMod="false" massDelta="0.9840156" residues="Q">
<cvParam cvRef="UNIMOD" accession="UNIMOD:7" name="Deamidated"/>
</SearchModification>
<SearchModification fixedMod="false" massDelta="-17.026548" residues="Q">
<SpecificityRules>
<cvParam cvRef="PSI-MS" accession="MS:1001189" name="modification specificity N-term"/>
</SpecificityRules>
<cvParam cvRef="UNIMOD" accession="UNIMOD:28" name="Gln->pyro-Glu"/>
</SearchModification>
<SearchModification fixedMod="false" massDelta="42.010563" residues=".">
<SpecificityRules>
<cvParam cvRef="PSI-MS" accession="MS:1002057" name="modification specificity protein N-term"/>
</SpecificityRules>
<cvParam cvRef="UNIMOD" accession="UNIMOD:1" name="Acetyl"/>
</SearchModification>
<SearchModification fixedMod="false" massDelta="79.96633" residues="S">
<cvParam cvRef="UNIMOD" accession="UNIMOD:21" name="Phospho"/>
</SearchModification>
<SearchModification fixedMod="false" massDelta="79.96633" residues="T">
<cvParam cvRef="UNIMOD" accession="UNIMOD:21" name="Phospho"/>
</SearchModification>
<SearchModification fixedMod="false" massDelta="79.96633" residues="Y">
<cvParam cvRef="UNIMOD" accession="UNIMOD:21" name="Phospho"/>
</SearchModification>
</ModificationParams>
In fact, you have specified a large number of dynamic mods, which will slow the search speed and greatly increases the search space. For a fairly small FASTA file you might be able to get away with this many dynamic mods, but for mammalian samples, you should limit yourself to tryptic peptides and/or search your data multiple times, one time looking for phosphorylated STY along with Carbamidomethyl Cys and possibly oxidized Methionine. Then, in a separate search, look for deamidation and acetylation.
Release 2019.02.01 adds several validation checks that should catch mistakenly defining the same dynamic modification on the same residue.
It works now. Thank you for pointing out the mistake.
I continue to have this problem with v2018.10.15.
@verheytb Please try the latest release, https://github.com/MSGFPlus/msgfplus/releases/tag/v2019.02.14 It includes checks for duplicate modification definitions, which might be your issue. Otherwise, you'll need to send us your input file, FASTA file, and command line arguments. Since these files are typically too large for e-mail, if you send your e-mail address to proteomics@pnnl.gov I can send you a https://fx.pnnl.gov/ file transfer request which will allow you to send use the large files using a website.
I have checked the mods file, and it does not have duplicate entries, so I suspect the problem lies elsewhere. I can send you my files. Thanks for your help,
Ted
From: Matthew Monroe notifications@github.com Sent: Friday, February 15, 2019 9:41 PM To: MSGFPlus/msgfplus Cc: Ted Verhey; Mention Subject: Re: [MSGFPlus/msgfplus] duplicate entry (#49)
@verheytbhttps://github.com/verheytb Please try the latest release, https://github.com/MSGFPlus/msgfplus/releases/tag/v2019.02.14 It includes checks for duplicate modification definitions, which might be your issue. Otherwise, you'll need to send us your input file, FASTA file, and command line arguments. Since these files are typically too large for e-mail, if you send your e-mail address to proteomics@pnnl.govmailto:proteomics@pnnl.gov I can send you a https://fx.pnnl.gov/ file transfer request which will allow you to send use the large files using a website.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/MSGFPlus/msgfplus/issues/49#issuecomment-464290017, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AKFCqstByH6nlmHw1wTDDXwnrKlTGaMKks5vN4vegaJpZM4XKY_Y.
Release 2019.02.27 includes a fix for this; when trying to fix this previously for #24, I failed to make one line of code conditional. There is another change put in place to fix a different output error relating to how MSGF+ internally tries to track protein N-Term methionine cleavage, that led to incorrect start/end locations and pre/post residues on certain PeptideEvidences. Thank you @verheytb for some test files that helped me to test and fix this problem.
We discovered an issue with the 2019.02.27 release, fixed with f3a67e446890690. Please instead use Release 2019.02.28.
Excellent, thank you!
Hi,
I've found an issue with an mzid xml output from v2018.09.12 when searching my samples with a mouse proteome (UP000000589_10090.fasta, Uniprot rel:March2018).
The issue was highlighted when I went on to run msgf2pin converter.
It highlighted a duplicated entry: XML parser error at C57_F_1_target.mzid:2026098:166
Upon inspection this refers to PepEv_7658488_LSTM+16SPR_32, which violates the XSD schema. (https://www.dropbox.com/s/a6751sdwkxmrill/C57_F_1_target.mzid?dl=0)
Any help/advice appreciated.
Regards
Craig