chhh / MSFTBX

MS File ToolBox - tools for parsing some mass-spectrometry related file formats (mzML, mzXML, pep.xml, prot.xml, etc.)
Apache License 2.0
12 stars 4 forks source link

ROC data when parsing pepXML #7

Open Owen-Duncan opened 6 years ago

Owen-Duncan commented 6 years ago

Hi, msftbx has been great, I've started using it extensively in an analysis pipeline. When parsing pepXML i'd like to retrieve the roc_data_point entries to determine FDRs at given probabilities. When i parse pepXML to an msmsPipelineAnalysis type the roc data doesn't seem to be present, though RocErrorData types are in the library. Using interprophet analysis on TPP 5.0.

chhh commented 6 years ago

@Owen-Duncan I've looked into this, and here's what I've found. PepXml schema doesn't specify where elements such as peptideprophet_summary should go, i.e. inside which elements they can be contained. However, it does provide a description of what peptideprophet_summary is, that's why you see RocData... and friends in MSFTBX.

What this means, is that there's no way for the automatic parser to know where to expect peptideprophet_summary, so it just never parses it by itself. BUT, you can still point a parser manually to the block of xml and parse it, I'm providing a code snippet below that will print all ROC info from a file.

// prepare the input stream
final XMLStreamReader xsr = JaxbUtils.createXmlStreamReader(p, false);
// advance the input stream to the beginning of <peptideprophet_summary>
final boolean foundPepProphSummary = XmlUtils.advanceReaderToNext(xsr, "peptideprophet_summary");
if (!foundPepProphSummary)
    throw new IllegalStateException("Could not advance the reader to the beginning of a peptideprophet_summary tag.");

// unmarshal
final PeptideprophetSummary ps = JaxbUtils.unmarshal(PeptideprophetSummary.class, xsr);

Make sure you're using MSFTBX v1.6.1 (it's on Maven Central now), there were a few fixes introduced.

I know this is waaay suboptimal, but I never noticed the issue as nobody ever needed to access that portion of the file. Too bad that the pepxml xsd schema is flawed. Here's a complete example:

public static void main(String[] args) throws Exception {

        // input file
        String pathIn = args[0];
        Path p = Paths.get(pathIn).toAbsolutePath();
        if (!Files.exists(p))
            throw new IllegalArgumentException("File doesn't exist: " + p.toString());

        //////////////////////////////////
        //
        //      Relevant part start
        //
        //////////////////////////////////

        // prepare the input stream
        final XMLStreamReader xsr = JaxbUtils.createXmlStreamReader(p, false);
        // advance the input stream to the beginning of <peptideprophet_summary>
        final boolean foundPepProphSummary = XmlUtils.advanceReaderToNext(xsr, "peptideprophet_summary");
        if (!foundPepProphSummary)
            throw new IllegalStateException("Could not advance the reader to the beginning of a peptideprophet_summary tag.");

        // unmarshal
        final PeptideprophetSummary ps = JaxbUtils.unmarshal(PeptideprophetSummary.class, xsr);

        //////////////////////////////////
        //
        //      Relevant part end
        //
        //////////////////////////////////

        // use the unmarshalled object
        StringBuilder sb = new StringBuilder();
        sb.append("Input files:");
        for (InputFileType inputFile : ps.getInputfile()) {
            sb.append("\n\t").append(inputFile.getName());
            if (!StringUtils.isNullOrWhitespace(inputFile.getDirectory()))
                sb.append(" @ ").append(inputFile.getDirectory());
        }
        for (RocErrorDataType rocErrorData : ps.getRocErrorData()) {
            sb.append("\n");
            sb.append(String.format("ROC Error data (charge '%s'): \n", rocErrorData.getCharge()));
            // roc_data_points
            for (RocDataPoint rocDataPoint : rocErrorData.getRocDataPoint()) {
                sb.append(String.format("ROC min_prob=\"%.3f\" sensitivity=\"%.3f\" error=\"%.3f\" " +
                                "num_corr=\"%d\" num_incorr=\"%d\"\n",
                        rocDataPoint.getMinProb(), rocDataPoint.getSensitivity(), rocDataPoint.getError(),
                        rocDataPoint.getNumCorr(), rocDataPoint.getNumIncorr()));
            }
            // error_points
            for (ErrorPoint errroPoint : rocErrorData.getErrorPoint()) {
                sb.append(String.format("ERR error=\"%.3f\" min_prob=\"%.3f\" num_corr=\"%d\" num_incorr=\"%d\"\n",
                        errroPoint.getError(), errroPoint.getMinProb(), errroPoint.getNumCorr(), errroPoint.getNumIncorr()));
            }
        }

        System.out.println(sb.toString());
    }
Owen-Duncan commented 6 years ago

Thank you! that worked perfectly.

for anyone following i needed to make two modifications to the code;

XmlUtils.advanceReaderToNextRunSummary

and

JaxbUtils.unmarshall

import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import javax.xml.stream.XMLStreamReader;

public class JAXBPEPXMLFDR {
    public static void main(String[] args) throws Exception{
        // input file
        String pathIn = args[0];
        Path p = Paths.get(pathIn).toAbsolutePath();
        if (!Files.exists(p))
            throw new IllegalArgumentException("File doesn't exist: " + p.toString());
        //prepare input stream
        final XMLStreamReader xsr = JaxbUtils.createXmlStreamReader(p, false);
        //advance reader to begining of <roc_error_data>
        final boolean foundPepProphSummary = XmlUtils.advanceReaderToNextRunSummary(xsr, "interprophet_summary");
        final InterprophetSummary ps = JaxbUtils.unmarshall(InterprophetSummary.class, xsr);
        // use the unmarshalled object
        StringBuilder sb = new StringBuilder();
        sb.append("Input files:");
        for (InputFileType inputFile : ps.getInputfile()) {
            sb.append("\n\t").append(inputFile.getName());
            if (!StringUtils.isNullOrWhitespace(inputFile.getDirectory()))
                sb.append(" @ ").append(inputFile.getDirectory());
        }
        for (RocErrorDataType rocErrorData : ps.getRocErrorData()) {
            sb.append("\n");
            sb.append(String.format("ROC Error data (charge '%s'): \n", rocErrorData.getCharge()));
            // roc_data_points
            for (RocDataPoint rocDataPoint : rocErrorData.getRocDataPoint()) {
                sb.append(String.format("ROC min_prob=\"%.3f\" sensitivity=\"%.3f\" error=\"%.3f\" " +
                                "num_corr=\"%d\" num_incorr=\"%d\"\n",
                        rocDataPoint.getMinProb(), rocDataPoint.getSensitivity(), rocDataPoint.getError(),
                        rocDataPoint.getNumCorr(), rocDataPoint.getNumIncorr()));
            }
            // error_points
            for (ErrorPoint errroPoint : rocErrorData.getErrorPoint()) {
                sb.append(String.format("ERR error=\"%.3f\" min_prob=\"%.3f\" num_corr=\"%d\" num_incorr=\"%d\"\n",
                        errroPoint.getError(), errroPoint.getMinProb(), errroPoint.getNumCorr(), errroPoint.getNumIncorr()));
            }
        }
        System.out.println(sb.toString());
    }
}
chhh commented 6 years ago

@Owen-Duncan in 1.6.1 I changed the names of those methods to better reflect what they're doing. Glad it's working for you.