MSGFPlus / msgfplus

MS-GF+ (aka MSGF+ or MSGFPlus) performs peptide identification by scoring MS/MS spectra against peptides derived from a protein sequence database.
Other
73 stars 36 forks source link

Wrong rank in the result #86

Open KaiLiCn opened 4 years ago

KaiLiCn commented 4 years ago

Hi,

I ran latest MS-GF+ to test my unspecific cleavage data, and I found there might be a wrong rank in the mzid result file. image AFNPEPLVL has 6 ion matches and DDIPSALRI only has 1 match for the same spectrum. But AFNPEPLVL is rank2. And I generated a mirror spectrum by PDV for these two PSMs: image The rank2 PSM has much better ion matches.

Can you take a look? This is my parameters: image For modifications, I only set Variable (dynamic): Oxidation on M (+15.9949).

This is test data: test.zip

Best regards, Kai

alchemistmatt commented 4 years ago

The rank 1 result has SpecEValue = 1.1E-7 and EValue = 8.86E-5 The rank 2 result has SpecEValue = 2.1E-4 and EValue = 0.156

These are both poor scores, and the NumMatchedMainIons count is thus unreliable. Looking at your mirror spectrum, the rank 2 result might be correct; manually inspecting results and making judgement calls is an option, albeit a very time consuming option.

MS-GF+ works well (for us) most of the time, but there will always be false negatives and false positives. Your analysis parameters look valid to me (though I'd use -t 10 ppm instead of -t 10.0ppm). Also, why are you limiting the maxcharge to 3? That seems too small (though, admittedly, the majority of results are usually 2+ or 3+). Oh, and I suggest -maxlength 50

Note that MS-GF+ now supports parameter files (aka config files). I suggest you start with https://github.com/MSGFPlus/msgfplus/blob/master/docs/ParameterFiles/MSGFPlus_PartTryp_MetOx_20ppmParTol.txt and update it to have InstrumentID=3 and PrecursorMassTolerance=10ppm In most cases you should be using NTT=1 and not NTT=0

Finally, I suggest you use TDA=1 so that you get QValues. If you repeat the search with TDA=1 you'll find that these results have horrible QValues (essentially FDR); a QValue of 0.05 means a 5% FDR.

https://msgfplus.github.io/msgfplus/MSGFPlus.html explains how to reference a parameter file using -conf

alchemistmatt commented 4 years ago

What program did you use to make the mirror spectrum? It is nicely annotated.

wenbostar commented 4 years ago

The rank 1 result has SpecEValue = 1.1E-7 and EValue = 8.86E-5 The rank 2 result has SpecEValue = 2.1E-4 and EValue = 0.156

These are both poor scores, and the NumMatchedMainIons count is thus unreliable. Looking at your mirror spectrum, the rank 2 result might be correct; manually inspecting results and making judgement calls is an option, albeit a very time consuming option.

MS-GF+ works well (for us) most of the time, but there will always be false negatives and false positives. Your analysis parameters look valid to me (though I'd use -t 10 ppm instead of -t 10.0ppm). Also, why are you limiting the maxcharge to 3? That seems too small (though, admittedly, the majority of results are usually 2+ or 3+). Oh, and I suggest -maxlength 50

For this specific example, based on the annotated spectra, rank 2 is obviously better than rank 1. The scores from the PSMs of the same spectrum usually should highly correlate with the numbers of matched ions for those PSMs. It's weird to me MS-GF+ got much lower EValue for the rank1 PSM compared with the rank2 PSM.

KaiLiCn commented 4 years ago

Thank you for prompt reply and comments. Actually, this is an immunopeptidome analysis, so the peptide length usually is 8-25 and e=1, ntt=0 are for nonspecific enzyme cleavage. And for maxcharge parameter, I think it isn't charge limitation, right? image

And I generated the mirror spectrum by PDV. https://github.com/wenbostar/PDV

FarmGeek4Life commented 4 years ago

For this specific example, based on the annotated spectra, rank 2 is obviously better than rank 1. The scores from the PSMs of the same spectrum usually should highly correlate with the numbers of matched ions for those PSMs. It's weird to me MS-GF+ got much lower EValue for the rank1 PSM compared with the rank2 PSM.

What appears to have happened in this case is the combined statistics for the matched ions in the Rank 1 result are better than the combined statistics for the matched ions in the Rank 2 result (see the 'MeanError...' and 'StdevError...' userParams). I don't know what all of the thresholds are, but the number of matched ions appears to not be significantly weighted when calculating those statistics or the "MS-GF:RawScore".

This may be an edge case for MS-GF+, in terms of having 2 short peptides with the same length and nearly identical mass (only 4ppm difference).

Anyway, if you wanted to re-run this search with -t 5ppm, the rank 1 result will disappear; its precursor mass error is -6.59ppm vs. -2.56ppm for the rank 2 result. However, this would not significantly change the low-confidence score from MS-GF+.

gsaxena888 commented 4 years ago

@KaiLiCn I'm not sure if this helps and I may be wrong, but I noticed some odd scenarios where the mzid output file was not displaying the correctly chosen "best" peptide for a spectra. After lots of trial-and-error and running the code through the debugger, I decided to comment out the logic that created a mzid file and instead just dump out the results to tsv, and that seemed to correct the problem. Here's the code that I inserted in the "MSGFPlus.java" file in the method entitled private static String runMSGFPlus(int ioIndex, SpecFileFormat specFormat, File outputFile, SearchParams params) {

the snippet of code to dump to tsv -->

        //gs
        try {
            final String SAMPLE_CSV_FILE = "c:/temp/gsmsgfplusout.tsv";
            BufferedWriter writer = Files.newBufferedWriter(Paths.get(SAMPLE_CSV_FILE));
            CSVPrinter csvPrinter = new CSVPrinter(writer, CSVFormat.TDF.withHeader("SpecIndex", "charge", "prec m/z", "PepMass", "PepSeq", "Accession(s)", "SpecEVal", "DeNovoScore", "Score", "FinalEVal"));
            final CompactFastaSequence seq = sa.getSequence();
            resultList
                .stream()
                .forEach(msgfPlusMatch -> {
                    msgfPlusMatch.getMatchList()
                        .stream()
                        .forEach(match -> {
                          //  System.out.println("match idx : " + match.getIndex() + " and pepseq " + match.getPepSeq() + " and eval " + match.getSpecEValue() + " and denovo " + match.getDeNovoScore() + " and score " + match.getScore());
                            try {
                                //DBSequence dbSeq = getDBSequence(protStartIndex);
                                final String stringOfAccessions = match.getIndices()
                                        .stream()
                                        .map(index -> {
                                        final int protStartIndex = (int) seq.getStartPosition(index);
                                        final String annotation = seq.getAnnotation(protStartIndex);
                                        //String proteinSeq = seq.getMatchingEntry(protStartIndex);
                                        final String accession = annotation.split("\\s+")[0];
                                        return accession;
                                    })
                                    .collect(Collectors.toList())
                                    .toString();

                                final int length = match.getLength();        // Peptide length + 2
                                final int numPeptides = sa.getNumDistinctPeptides(params.getEnzyme() == null ? length - 2 : length - 1);
                                final double specEValue = match.getSpecEValue();
                                final double eValue = specEValue * numPeptides;

                                final int charge = match.getCharge();

                                final float peptideMass = match.getPeptideMass();
//              float theoMass = peptideMass + (float)Composition.H2O;
//              float theoMz = theoMass/charge;
                                final float theoMz = (peptideMass + (float) Composition.H2O) / charge + (float) Composition.ChargeCarrierMass();

                                csvPrinter.printRecord(msgfPlusMatch.getSpecIndex(), charge, theoMz, peptideMass, match.getPepSeq(), stringOfAccessions, specEValue, match.getDeNovoScore(), match.getScore(), eValue);
                            } catch (Exception ex) {
                                throw new RuntimeException(ex);
                            }

                        });
            });
            csvPrinter.flush();
        } catch (Exception ex) {
            throw new RuntimeException(ex);
        }
        //end gs

I added the above code just after the following lines:

// Sort by spectral E-values then write to disk

        long saveResultsStartTime = System.currentTimeMillis();

        System.out.println("Writing results...");
        Collections.sort(resultList);

and I also then commented out the ouputting to mzid, as follows:


      //  MZIdentMLGen mzidGen = new MZIdentMLGen(params, aaSet, sa, specAcc, ioIndex);
      //  mzidGen.addSpectrumIdentificationResults(resultList);

        //mzidGen.writeResults(outputFile);

If you do try this hack, PLEASE LET ME KNOW if it solves/helps your problem, as it'll validate that my hunch was right. (I still didn't figure out the root cause of the problem, but my guess is that it's somewhere in the mzid generation; that said, I also hacked around the code a lot, so there's a possibility that it was my own hacking that caused problems, but I'm at least 50% sure that that is not the case....)

(Forgot to mention: You'll also need to add the following entry to your pom.xml file:

  <!-- https://mvnrepository.com/artifact/org.apache.commons/commons-csv -->
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-csv</artifactId>
            <version>1.7</version>
        </dependency>