chhh / MSFTBX

MS File ToolBox - tools for parsing some mass-spectrometry related file formats (mzML, mzXML, pep.xml, prot.xml, etc.)
Apache License 2.0
12 stars 4 forks source link

Memory issue when parsing big mzml file. #9

Open KenWifi opened 6 years ago

KenWifi commented 6 years ago

Hi,

Recent days, I tried to parsing 2GB mzml file by follow codes: `MZMLFile mzmlFile = new MZMLFile(spectrumFiles.get(0).getAbsolutePath());

                    ScanCollectionDefault scans = new ScanCollectionDefault();

                    scans.setDefaultStorageStrategy(StorageStrategy.SOFT);

                    scans.isAutoloadSpectra(true);

                    scans.setDataSource(mzmlFile);

                    mzmlFile.setNumThreadsForParsing(threads);

                    try {
                        scans.loadData(LCMSDataSubset.MS1_WITH_SPECTRA);
                        scans.loadData(LCMSDataSubset.MS2_WITH_SPECTRA);
                    } catch (FileParsingException e) {
                        e.printStackTrace();
                        System.exit(1);
                    }`

And the memory was increasing to 4GB and ending up with memory issue. And BatMass had same problem. Do you have any experience about parsing big file?

Kai

chhh commented 5 years ago

Didn't see the question here originally, but will still leave an answer.

You're trying to load the whole file in memory. The original file might be quite well compressed with gzip or MsNumpress, so the resulting size of the whole file in memory might be significantly larger. 1st, of course, try loading only MS1 or only MS2. If that doesn't help there's another way, which is slower that the standard mode, but won't use much memory for any file size:

try (final MZMLFile mzml = new MZMLFile("path-to-mzml")) {
    // Create data source with auto-loading of spectra set
    IScanCollection scans = new ScanCollectionDefault(true);
    scans.setDataSource(mzml);
    // Only load the data structure (i.e. scan meta-data) without spectra.
    // Set StorageStrategy to SOFT - will allow garbage collector to reclaim spectra
    // that are dangling in memory but not being used.
    scans.loadData(LCMSDataSubset.STRUCTURE_ONLY, StorageStrategy.SOFT);

    TreeMap<Integer, IScan> index = scans.getMapNum2scan();
    for (Entry<Integer, IScan> e : index.entrySet()) {
        IScan scan = e.getValue();
        // You need to use `fetchSpectrum()`, because the spectrum might have been
        // garbage collected
        ISpectrum spectrum = scan.fetchSpectrum();
    }
}