instrument settings in MSGFplusAdapter

lars20070 commented 9 years ago

Despite choosing the setting instrument = low_res the Q Exactive settings are used by MSGF+, see screenshot.

msgfplusadapter

cbielow commented 9 years ago

adapter problem or MSGF problem?

oliverschilling commented 9 years ago

when running msg from the command line, I don't ge this message...

lars20070 commented 9 years ago

In the temp folder one can only intercept msgfplus_mod.txt which in specifies the mods correctly. The remaining parameters are passed as options directly on the command line.

lars20070 commented 9 years ago

The MSGF+ output reports

Search Parameters: Instrument: LowRes

So I guess the adapter is passing on the setting correctly. But MSGF+ also reports

Loading build-in param file: HCD_QExactive_Tryp.param (screenshot above)

Are LowRes or QExactive params used? @hendrikweisser do we need to worry about the QExactive message?

lars20070 commented 9 years ago

With 200 ppm precursor mass tolerance and instrument = LowRes, the MSGF+ search of 3899 MS2 spectra results in just three peptide identifications.

hendrikweisser commented 9 years ago

I think the parameter should be passed on correctly by the adapter:

I can't comment on what MS-GF+ does internally and whether it might have a bug there.

lars20070 commented 9 years ago

@hendrikweisser We did run MSGFplus a couple of times from the command line. The message

Loading build-in param file: HCD_QExactive_Tryp.param

never turns up. The message clearly comes from MSGFplus.

hendrikweisser commented 9 years ago

Can you add a debug output for the parameters used for the MS-GF+ call in the adapter (variable "process_params", https://github.com/OpenMS/OpenMS/blob/develop/src/topp/MSGFPlusAdapter.cpp#L501), and see how that differs from your manual call?

lars20070 commented 9 years ago

@hendrikweisser I did run the same toppas workflow on Linux (in order to check the proposed debug output). The MSGF+ message does not appear.

Loading build-in param file: HCD_QExactive_Tryp.param

Under Windows the above message appears always, QExactive or LowRes setting. In the case of LowRes data the settings from the adapter seem to be overridden and we get only a handful of hits.

lars20070 commented 9 years ago

@hendrikweisser I did run MSGF+ on the low-res Q Star data (-t 200ppm -inst 0) [1]. Looks fine, plenty of peptide hits. IDFileConverter produces idXML with 2,671 peptide hits. But it's not possible to open the idXML in TOPPView.

2445 peptide identification(s) without sequence and/or retention time information were removed. 0 peptide identification(s) remaining.

Guess this is a mzid to idXML file conversion bug. That might be the reason why we see no peptide IDs from MSGFplusAdapter on Q Star data.

[1] https://www.dropbox.com/sh/p5iejarad5812ki/AABiT1CdfzeUjTSx8RIZc8wRa

hendrikweisser commented 9 years ago

I did run MSGF+ on the low-res Q Star data (-t 200ppm -inst 0) [1].

Did you run MS-GF+ by itself or via MSGFPlusAdapter?

But it's not possible to open the idXML in TOPPView.

The reason is that the retention times are missing. In mzid, retention times are stored as "cvParam" elements in "SpectrumIdentificationResult": <cvParam accession="MS:1000894" cvRef="PSI-MS" name="retention time" value="5432.1"/>

This information is missing in your file. I don't know why that's the case.

Guess this is a mzid to idXML file conversion bug.

No, the information isn't there in the first place.

lars20070 commented 9 years ago

@hendrikweisser I did run MSGF+ by itself from the command line. Thanks for the info. Guess either MSGF+ or msconvert is the problem. The Q Star machine is pretty old.

Stortebecker commented 9 years ago

@hendrikweisser I have the same problem with Orbitrap XL data. Also here, the retention times are already missing in the mzid files. I ran MSGF+ from commandline.

hendrikweisser commented 9 years ago

@lars20070, @Stortebecker: Did you use mzML as input for your MS-GF+ runs? Do the input files contain retention time information, e.g.:

<scan>
    <cvParam cvRef="MS" accession="MS:1000016" name="scan start time" value="2655.095703125" unitAccession="UO:0000010" unitName="second" unitCvRef="UO" />
...

lars20070 commented 9 years ago

@hendrikweisser Yes. We used mzML input with RT. Everything fine when opening in TOPPView.

Could it be that the retention time is optional in the mzIdentML format, and MSGF+ simply never writes it?

hendrikweisser commented 9 years ago

Could it be that the retention time is optional in the mzIdentML format, and MSGF+ simply never writes it?

Oh, yes, you are right. I thought RT must be included since we get it from the adapter, but there we actually have to look it up in the mzML file (which I forgot about).

lars20070 commented 9 years ago

But mzid + mzML -> idXML in IDFileConverter is not supported yet. Right? We just tested it.

hendrikweisser commented 9 years ago

But mzid + mzML -> idXML in IDFileConverter is not supported yet.

Nope. It would be easy to add, because the code is already in MSGFPlusAdapter (should really be factored out into a library function somewhere, though).

lars20070 commented 9 years ago

Just as reference, the command line options used by the adapter during the Q Star search.

java -Xmx3500m -jar /home/lars/Desktop/MSGFplus_Lars/MSGFPlus.jar -s example.mzML -o /tmp/2015-05-21_164657_LinuxSchilling_20055_1/msgfplus_output.mzid -d uniprot-koli-k12-nov24-2011-plus-shuffled.fasta -t 200ppm -ti 0,1 -tda 0 -m 0 -inst 0 -e 5 -protocol 0 -ntt 1 -minLength 7 -maxLength 50 -minCharge 2 -maxCharge 4 -n 1 -addFeatures 0 -thread 1 -mod /tmp/2015-05-21_164657_LinuxSchilling_20055_1/msgfplus_mods.txt

lars20070 commented 9 years ago

@hendrikweisser Straight out of the search engine the number of PeptideIdentifications (spectra) and PeptideHits should be identical. Right? For example, FileInfo on a normal search result delivers

spectra: 2567 peptide hits: 2567

But the idXML straight out of the MSGFPlusAdapter has

spectra: 3 peptide hits: 3512

How is this possible? - In conclusion, mzid contains many peptide hits (although without RT). But they are written into merely three PeptideIdentifications.

lars20070 commented 9 years ago

@hendrikweisser I opened the mzid from the command line in ProteoIDViewer. Together with the source mzML. 3512 peptide hits. Looks fine. The idXML from the MSGFPlusAdapter shows only 3 peptides when opened in TOPPView.

lars20070 commented 9 years ago

In fixed-mod MSGF+ searches of Q Exactive data, we see that PSMs overlap. One and the same spectrum gets a hit from the light, medium and heavy search. Without #1437 it is not possible to compare the search results from the adapter with the command-line results in TOPPView.

overlaps

hendrikweisser commented 9 years ago

Without #1437 it is not possible to compare the search results from the adapter with the command-line results in TOPPView.

Why not just compare the output files (e.g. by opening them side by side)? You don't need TOPPView for that. You can also export them to CSV with TextExporter and run some statistics on them.

lars20070 commented 9 years ago

I did. 3 < 3512. The results are not the same.

hendrikweisser commented 9 years ago

What happens when you compare the mzid from the adapter to the mzid from the command line?

oliverschilling commented 9 years ago

Dear all, this topic seems to develop into a heated debate...msgf+ is a very powerful search engine. it would be shame if some glitches prevent its full usage within OpenMS. @hendrikweisser: if I understand correctly, you are the expert for the msgf+ adapter. Could you please take a look into the discrepancy between spectra and peptide hits? Thanks ! Oliver

hendrikweisser commented 9 years ago

@oliverschilling:

Could you please take a look into the discrepancy between spectra and peptide hits?

I don't want to say "no", but I'm currently busy with other things. If you don't want to wait until I have more time for this, then you/Lars will have to do some of the digging. As a first step, please compare the mzid files, to find whether the problem is with how MS-GF+ is run in the adapter or with the conversion of mzid to idXML.

lars20070 commented 9 years ago

The mzid are identical since both are generated by the same command line, see above. It seems the bug is in the conversion of the search result.

hendrikweisser commented 9 years ago

The mzid are identical since both are generated by the same command line, see above.

Okay. When you convert the mzid from MS-GF+ (command line) with IDFileConverter, do you get the "correct" idXML (just without RT information)?

hendrikweisser commented 9 years ago

Lars, could you also run the MSGFPlusAdapter with "-debug 2"? We'd need to check the intermediate file "msgfplus_converted.tsv" that gets created in the temporary directory.

lars20070 commented 9 years ago

@hendrikweisser As suggested, I intercepted the java command line (looks good) from the adapter, executed MSGFPlus from the command line, followed by IDFileConverter and FileInfo.

-- General information --

File name: OS_CP10_P290115_260315_centroided_fromCommanLine.idXML File type: idXML Number of: runs: 1 protein hits: 2609 non-redundant protein hits: 2609 (only hits that differ in the accession)

spectra: 3300 peptide hits: 3536 modified top-hits: 3300/3300 (100%) non-redundant peptide hits: 3448 (only hits that differ in sequence and/ or modifications) Modifications (top-hits only): Carbamidomethyl(628), Dimethyl(2919)

On the other hand, the MSGFPlusAdapter followed by FileInfo results in this. This first idXML file is by a factor 4.7 larger. It does not contain RT information, but much more meta data.

-- General information --

File name: OS_CP10_P290115_260315_centroided_fromMSGFPlusAdapter.idXML File type: idXML Number of: runs: 1 protein hits: 2609 non-redundant protein hits: 2609 (only hits that differ in the accession)

spectra: 3 peptide hits: 3500 modified top-hits: 3/3 (100%) non-redundant peptide hits: 3448 (only hits that differ in sequence and/ or modifications) Modifications (top-hits only): Dimethyl(1)

lars20070 commented 9 years ago

Protein hits and non-redundant peptide hits are correct. Spectra (PeptideIdentifications) and peptide hits are not correct.

Both numbers (spectra = 3300, peptide hits = 3536) are confirmed by opening the mzid in ProteoIDViewer. Checking the tsv and going through your 'create idXML output' code will be the next step.

lars20070 commented 9 years ago

MSGF+ returns mzid. It seems IDFileConverter and MSGFPlusAdapter are not using the same (refactored) code for the mzid -> idXML conversion. Why not?

mwalzer commented 9 years ago

brief answers: mzid - rt attribute: optional -> use RTAnnotator (UTILS) MSGFPlus adapter is actually (only a litle bit) older than that, I'd consider the tsv reading part of MSGFPlusAdapter deprecated. options: remove MSGFPlusAdapter tsv part and leave the output with the natively written mzid.

lars20070 commented 9 years ago

Thanks @mwalzer! I did not know about the RTAnnotator util.

hendrikweisser commented 9 years ago

I did not know about the RTAnnotator util.

Me neither. I would like to have this functionality (also?) in IDFileConverter, since that's where it is for the other search engine formats. I'd be in favour of reopening #1437.

remove MSGFPlusAdapter tsv part and leave the output with the natively written mzid.

I agree that the TSV part should be removed now that the mzid conversion is more mature. However, MSGFPlusAdapter should still produce an idXML (if required, and including RT information) to be in line with the other adapters.

lars20070 commented 9 years ago

Note that there are still some issues with the mzid -> idXML conversion https://github.com/OpenMS/OpenMS/issues/1446. The rank does not appear to be correct. That's all.

timosachsenberg commented 9 years ago

so what we need to do:

use IDFileConverter code in MSGFPlusAdapter to convert mzid to idXML

        MzIdentMLFile().load(in, protein_identifications, peptide_identifications);
        IdXMLFile().store(out, protein_identifications, peptide_identifications);

use RTAnnotator code to annotate RT and check if this code can be factored out of the tools and reused

    //RTAnnotate
    int c = 0;
    for (vector<PeptideIdentification>::iterator id_it = peptides.begin(); id_it != peptides.end(); ++id_it)
    {
      String scannumber = String(id_it->getMetaValue("spectrum_reference"));
      for (MSExperiment<>::Iterator exp_it = experiment.begin();
           exp_it != experiment.end(); ++exp_it)
      {
        if (exp_it->getNativeID() == scannumber)
        {
          id_it->setRT(exp_it->getRT());
          ++c;
          break;
        }
      }
    }

timosachsenberg commented 9 years ago

what is not clear to me:

why does this occur for QStar data?

lars20070 commented 9 years ago

@timosachsenberg Here two idXML from the same MS-GF+ search: once [1] MS-GF+ run from the command line then converted using IDFileConverter (i.e. mzid -> idXML). Then [2] from the MS-GF+ adapter (i.e. mzid -> tsv -> idXML).

[1] https://gist.github.com/lars20070/6b56821e0d73788fa35d [2] https://gist.github.com/lars20070/65c7e760a04be4c99f4f

lars20070 commented 9 years ago

Here a peptide sequence from the idXML from MS-GF+ command line plus IDFileConverter.

        <PeptideIdentification score_type="MS-GF:EValue" higher_score_better="true" significance_threshold="0" MZ="414.754455566406" spectrum_reference="sample=1 period=1 cycle=2006 experiment=3" >
            <PeptideHit score="6.184501" sequence="(Dimethyl)VSAEK(Dimethyl)EK(Dimethyl)ALSLLAGR" charge="4" aa_before="E" aa_after="I" protein_refs="PH_530" >
                <UserParam type="string" name="MS:1002049" value="-14"/>
                <UserParam type="string" name="MS:1002050" value="40"/>
                <UserParam type="string" name="MS:1002052" value="2.3052996E-6"/>
                <UserParam type="string" name="MS:1002053" value="6.184501"/>
                <UserParam type="string" name="AssumedDissociationMethod" value="HCD"/>
                <UserParam type="string" name="IsotopeError" value="0"/>
                <UserParam type="float" name="calcMZ" value="414.756805419922"/>
                <UserParam type="int" name="start" value="52"/>
                <UserParam type="int" name="end" value="66"/>
                <UserParam type="string" name="target_decoy" value="target"/>
            </PeptideHit>
        </PeptideIdentification>

and here from the MSGFPlusAdapter. The PeptideHits are not wrapped in PeptideIdentifications and incomplete.

            <PeptideHit score="2.3052996e-06" sequence="(Dimethyl)VSAEK(Dimethyl)EK(Dimethyl)ALSLLAGR" charge="4" aa_before="E" aa_after="I" protein_refs="PH_2078" >

timosachsenberg commented 9 years ago

Ok there seem to be some differences: e.g.

search_engine="MSGFPlus" vs. "MSGF+"
search_engine_version="" vs. "Beta (v10089)"
missed_cleavages="0" precursor_peak_tolerance="0.133333333333333" peak_mass_tolerance="0" vs.  missed_cleavages="1000" precursor_peak_tolerance="200" peak_mass_tolerance="0"

hendrikweisser commented 9 years ago

Have we actually found out what exactly causes the failure in MSGFPlusAdapter?

ypriverol commented 8 years ago

I guess is the same issue that I have in my Mac. @hendrikweisser

hendrikweisser commented 8 years ago

@ypriverol:

I guess is the same issue that I have in my Mac.

This issue is about the adapter working incorrectly for some datasets, not about it not working at all (like #1764 and #1771). So I don't think this is the issue that you have (at least not as far as I know).

hendrikweisser commented 8 years ago

Any updates on this issue? Has it been resolved in the meantime?

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

OpenMS / OpenMS

instrument settings in MSGFplusAdapter #1409