Closed lars20070 closed 5 years ago
adapter problem or MSGF problem?
when running msg from the command line, I don't ge this message...
In the temp folder one can only intercept msgfplus_mod.txt which in specifies the mods correctly. The remaining parameters are passed as options directly on the command line.
The MSGF+ output reports
Search Parameters: Instrument: LowRes
So I guess the adapter is passing on the setting correctly. But MSGF+ also reports
Loading build-in param file: HCD_QExactive_Tryp.param (screenshot above)
Are LowRes or QExactive params used? @hendrikweisser do we need to worry about the QExactive message?
With 200 ppm precursor mass tolerance and instrument = LowRes, the MSGF+ search of 3899 MS2 spectra results in just three peptide identifications.
I think the parameter should be passed on correctly by the adapter:
I can't comment on what MS-GF+ does internally and whether it might have a bug there.
@hendrikweisser We did run MSGFplus a couple of times from the command line. The message
Loading build-in param file: HCD_QExactive_Tryp.param
never turns up. The message clearly comes from MSGFplus.
Can you add a debug output for the parameters used for the MS-GF+ call in the adapter (variable "process_params", https://github.com/OpenMS/OpenMS/blob/develop/src/topp/MSGFPlusAdapter.cpp#L501), and see how that differs from your manual call?
@hendrikweisser I did run the same toppas workflow on Linux (in order to check the proposed debug output). The MSGF+ message does not appear.
Loading build-in param file: HCD_QExactive_Tryp.param
Under Windows the above message appears always, QExactive or LowRes setting. In the case of LowRes data the settings from the adapter seem to be overridden and we get only a handful of hits.
@hendrikweisser I did run MSGF+ on the low-res Q Star data (-t 200ppm -inst 0) [1]. Looks fine, plenty of peptide hits. IDFileConverter produces idXML with 2,671 peptide hits. But it's not possible to open the idXML in TOPPView.
2445 peptide identification(s) without sequence and/or retention time information were removed. 0 peptide identification(s) remaining.
Guess this is a mzid to idXML file conversion bug. That might be the reason why we see no peptide IDs from MSGFplusAdapter on Q Star data.
[1] https://www.dropbox.com/sh/p5iejarad5812ki/AABiT1CdfzeUjTSx8RIZc8wRa
I did run MSGF+ on the low-res Q Star data (-t 200ppm -inst 0) [1].
Did you run MS-GF+ by itself or via MSGFPlusAdapter?
But it's not possible to open the idXML in TOPPView.
The reason is that the retention times are missing. In mzid, retention times are stored as "cvParam" elements in "SpectrumIdentificationResult":
<cvParam accession="MS:1000894" cvRef="PSI-MS" name="retention time" value="5432.1"/>
This information is missing in your file. I don't know why that's the case.
Guess this is a mzid to idXML file conversion bug.
No, the information isn't there in the first place.
@hendrikweisser I did run MSGF+ by itself from the command line. Thanks for the info. Guess either MSGF+ or msconvert is the problem. The Q Star machine is pretty old.
@hendrikweisser I have the same problem with Orbitrap XL data. Also here, the retention times are already missing in the mzid files. I ran MSGF+ from commandline.
@lars20070, @Stortebecker: Did you use mzML as input for your MS-GF+ runs? Do the input files contain retention time information, e.g.:
<scan>
<cvParam cvRef="MS" accession="MS:1000016" name="scan start time" value="2655.095703125" unitAccession="UO:0000010" unitName="second" unitCvRef="UO" />
...
@hendrikweisser Yes. We used mzML input with RT. Everything fine when opening in TOPPView.
Could it be that the retention time is optional in the mzIdentML format, and MSGF+ simply never writes it?
Could it be that the retention time is optional in the mzIdentML format, and MSGF+ simply never writes it?
Oh, yes, you are right. I thought RT must be included since we get it from the adapter, but there we actually have to look it up in the mzML file (which I forgot about).
But mzid + mzML -> idXML in IDFileConverter is not supported yet. Right? We just tested it.
But mzid + mzML -> idXML in IDFileConverter is not supported yet.
Nope. It would be easy to add, because the code is already in MSGFPlusAdapter (should really be factored out into a library function somewhere, though).
Just as reference, the command line options used by the adapter during the Q Star search.
java -Xmx3500m -jar /home/lars/Desktop/MSGFplus_Lars/MSGFPlus.jar -s example.mzML -o /tmp/2015-05-21_164657_LinuxSchilling_20055_1/msgfplus_output.mzid -d uniprot-koli-k12-nov24-2011-plus-shuffled.fasta -t 200ppm -ti 0,1 -tda 0 -m 0 -inst 0 -e 5 -protocol 0 -ntt 1 -minLength 7 -maxLength 50 -minCharge 2 -maxCharge 4 -n 1 -addFeatures 0 -thread 1 -mod /tmp/2015-05-21_164657_LinuxSchilling_20055_1/msgfplus_mods.txt
@hendrikweisser Straight out of the search engine the number of PeptideIdentifications (spectra) and PeptideHits should be identical. Right? For example, FileInfo on a normal search result delivers
spectra: 2567 peptide hits: 2567
But the idXML straight out of the MSGFPlusAdapter has
spectra: 3 peptide hits: 3512
How is this possible? - In conclusion, mzid contains many peptide hits (although without RT). But they are written into merely three PeptideIdentifications.
@hendrikweisser I opened the mzid from the command line in ProteoIDViewer. Together with the source mzML. 3512 peptide hits. Looks fine. The idXML from the MSGFPlusAdapter shows only 3 peptides when opened in TOPPView.
In fixed-mod MSGF+ searches of Q Exactive data, we see that PSMs overlap. One and the same spectrum gets a hit from the light, medium and heavy search. Without #1437 it is not possible to compare the search results from the adapter with the command-line results in TOPPView.
Without #1437 it is not possible to compare the search results from the adapter with the command-line results in TOPPView.
Why not just compare the output files (e.g. by opening them side by side)? You don't need TOPPView for that. You can also export them to CSV with TextExporter and run some statistics on them.
I did. 3 < 3512. The results are not the same.
What happens when you compare the mzid from the adapter to the mzid from the command line?
Dear all, this topic seems to develop into a heated debate...msgf+ is a very powerful search engine. it would be shame if some glitches prevent its full usage within OpenMS. @hendrikweisser: if I understand correctly, you are the expert for the msgf+ adapter. Could you please take a look into the discrepancy between spectra and peptide hits? Thanks ! Oliver
@oliverschilling:
Could you please take a look into the discrepancy between spectra and peptide hits?
I don't want to say "no", but I'm currently busy with other things. If you don't want to wait until I have more time for this, then you/Lars will have to do some of the digging. As a first step, please compare the mzid files, to find whether the problem is with how MS-GF+ is run in the adapter or with the conversion of mzid to idXML.
The mzid are identical since both are generated by the same command line, see above. It seems the bug is in the conversion of the search result.
The mzid are identical since both are generated by the same command line, see above.
Okay. When you convert the mzid from MS-GF+ (command line) with IDFileConverter, do you get the "correct" idXML (just without RT information)?
Lars, could you also run the MSGFPlusAdapter with "-debug 2"? We'd need to check the intermediate file "msgfplus_converted.tsv" that gets created in the temporary directory.
@hendrikweisser As suggested, I intercepted the java command line (looks good) from the adapter, executed MSGFPlus from the command line, followed by IDFileConverter and FileInfo.
-- General information --
File name: OS_CP10_P290115_260315_centroided_fromCommanLine.idXML File type: idXML Number of: runs: 1 protein hits: 2609 non-redundant protein hits: 2609 (only hits that differ in the accession)
spectra: 3300 peptide hits: 3536 modified top-hits: 3300/3300 (100%) non-redundant peptide hits: 3448 (only hits that differ in sequence and/ or modifications) Modifications (top-hits only): Carbamidomethyl(628), Dimethyl(2919)
On the other hand, the MSGFPlusAdapter followed by FileInfo results in this. This first idXML file is by a factor 4.7 larger. It does not contain RT information, but much more meta data.
-- General information --
File name: OS_CP10_P290115_260315_centroided_fromMSGFPlusAdapter.idXML File type: idXML Number of: runs: 1 protein hits: 2609 non-redundant protein hits: 2609 (only hits that differ in the accession)
spectra: 3 peptide hits: 3500 modified top-hits: 3/3 (100%) non-redundant peptide hits: 3448 (only hits that differ in sequence and/ or modifications) Modifications (top-hits only): Dimethyl(1)
Protein hits and non-redundant peptide hits are correct. Spectra (PeptideIdentifications) and peptide hits are not correct.
Both numbers (spectra = 3300, peptide hits = 3536) are confirmed by opening the mzid in ProteoIDViewer. Checking the tsv and going through your 'create idXML output' code will be the next step.
MSGF+ returns mzid. It seems IDFileConverter and MSGFPlusAdapter are not using the same (refactored) code for the mzid -> idXML conversion. Why not?
brief answers: mzid - rt attribute: optional -> use RTAnnotator (UTILS) MSGFPlus adapter is actually (only a litle bit) older than that, I'd consider the tsv reading part of MSGFPlusAdapter deprecated. options: remove MSGFPlusAdapter tsv part and leave the output with the natively written mzid.
Thanks @mwalzer! I did not know about the RTAnnotator util.
I did not know about the RTAnnotator util.
Me neither. I would like to have this functionality (also?) in IDFileConverter, since that's where it is for the other search engine formats. I'd be in favour of reopening #1437.
remove MSGFPlusAdapter tsv part and leave the output with the natively written mzid.
I agree that the TSV part should be removed now that the mzid conversion is more mature. However, MSGFPlusAdapter should still produce an idXML (if required, and including RT information) to be in line with the other adapters.
Note that there are still some issues with the mzid -> idXML conversion https://github.com/OpenMS/OpenMS/issues/1446. The rank does not appear to be correct. That's all.
so what we need to do:
MzIdentMLFile().load(in, protein_identifications, peptide_identifications);
IdXMLFile().store(out, protein_identifications, peptide_identifications);
//RTAnnotate
int c = 0;
for (vector<PeptideIdentification>::iterator id_it = peptides.begin(); id_it != peptides.end(); ++id_it)
{
String scannumber = String(id_it->getMetaValue("spectrum_reference"));
for (MSExperiment<>::Iterator exp_it = experiment.begin();
exp_it != experiment.end(); ++exp_it)
{
if (exp_it->getNativeID() == scannumber)
{
id_it->setRT(exp_it->getRT());
++c;
break;
}
}
}
what is not clear to me:
@timosachsenberg Here two idXML from the same MS-GF+ search: once [1] MS-GF+ run from the command line then converted using IDFileConverter (i.e. mzid -> idXML). Then [2] from the MS-GF+ adapter (i.e. mzid -> tsv -> idXML).
[1] https://gist.github.com/lars20070/6b56821e0d73788fa35d [2] https://gist.github.com/lars20070/65c7e760a04be4c99f4f
Here a peptide sequence from the idXML
from MS-GF+ command line plus IDFileConverter
.
<PeptideIdentification score_type="MS-GF:EValue" higher_score_better="true" significance_threshold="0" MZ="414.754455566406" spectrum_reference="sample=1 period=1 cycle=2006 experiment=3" >
<PeptideHit score="6.184501" sequence="(Dimethyl)VSAEK(Dimethyl)EK(Dimethyl)ALSLLAGR" charge="4" aa_before="E" aa_after="I" protein_refs="PH_530" >
<UserParam type="string" name="MS:1002049" value="-14"/>
<UserParam type="string" name="MS:1002050" value="40"/>
<UserParam type="string" name="MS:1002052" value="2.3052996E-6"/>
<UserParam type="string" name="MS:1002053" value="6.184501"/>
<UserParam type="string" name="AssumedDissociationMethod" value="HCD"/>
<UserParam type="string" name="IsotopeError" value="0"/>
<UserParam type="float" name="calcMZ" value="414.756805419922"/>
<UserParam type="int" name="start" value="52"/>
<UserParam type="int" name="end" value="66"/>
<UserParam type="string" name="target_decoy" value="target"/>
</PeptideHit>
</PeptideIdentification>
and here from the MSGFPlusAdapter
. The PeptideHits
are not wrapped in PeptideIdentifications
and incomplete.
<PeptideHit score="2.3052996e-06" sequence="(Dimethyl)VSAEK(Dimethyl)EK(Dimethyl)ALSLLAGR" charge="4" aa_before="E" aa_after="I" protein_refs="PH_2078" >
Ok there seem to be some differences: e.g.
search_engine="MSGFPlus" vs. "MSGF+"
search_engine_version="" vs. "Beta (v10089)"
missed_cleavages="0" precursor_peak_tolerance="0.133333333333333" peak_mass_tolerance="0" vs. missed_cleavages="1000" precursor_peak_tolerance="200" peak_mass_tolerance="0"
Have we actually found out what exactly causes the failure in MSGFPlusAdapter?
I guess is the same issue that I have in my Mac. @hendrikweisser
@ypriverol:
I guess is the same issue that I have in my Mac.
This issue is about the adapter working incorrectly for some datasets, not about it not working at all (like #1764 and #1771). So I don't think this is the issue that you have (at least not as far as I know).
Any updates on this issue? Has it been resolved in the meantime?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Despite choosing the setting instrument = low_res the Q Exactive settings are used by MSGF+, see screenshot.