Calculate dataset statistics

Added two new features calc_statistics and check_peaks_overlap

calc_statistics

A function that calculates statistics regarding m/z and intensity. By default, statistics of the number of peaks for all spectra are calculated (based on the analysis of the imzML file):

ds_peaks_stats = {'min': 13483, 'median': 40803, '95p': 44600, 'max': 47324}

Also, this function accepts two optional parameters:

n_spectrum - the number of randomly selected spectrum for which the analysis is performed. The default value is 100.
full - analysis on all spectrum. For large datasets, it can take a long time. The default value is False

If one of these parameters is activated, the function returns additional data. Part of the values is per spectrum, part per dataset:

{
    'mz_min': 50.0002,  // minimum `m/z` value among all spectra
    'mz_mzn': 1199.976,  // maximum `m/z` value among all spectra
    'mzs_min': [50.00475, 50.003, ... 50.0045],  // minimum `m/z` values for each spectrum selected for analysis
    'mzs_max': [1199.922, 1199.928, ... 1199.91],  // maximum `m/z` values for each spectrum selected for analysis
    'mzs_digitized': (53, 10895), (52, 10861), ... (1987, 1),  // pairs of `m/z` values (integer value of Da) and the number of peaks that are in the range (m/z±0.5Da), in total among all analyzed spectra
    'ints_min': [4.9351587, 3.075072, ... 3.7348557],  // minimum intensity values for each spectrum selected for analysis
    'ints_50p': [19.123741, 11.147136, ... 13.071995],  // 50 percentile intensity value for each spectrum selected for analysis
    'ints_95p': [54.903645, 36.51648, ... 39.682842],  // 95 percentile intensity value for each spectrum selected for analysis
    'ints_max': [7110.9473,  8336.904, ...  8003.7954],  // minimum intensity values for each spectrum selected for analysis
    'ints_total': [1073460,  700179.3,  ...  779535.3 ],  // total intensity value for each spectrum selected for analysis
    'nonzero_intensity_lengths': [40390, 39074, ... 39162],  // number of peaks that have non-zero intensity for each spectrum selected for analysis
    'nonzero_peaks_percentage': 89.34,  // percentage of peaks that have a non-zero value among all analyzed spectra
}

check_peaks_overlap

This function represents an approach for finding non-centroided datasets based on comparing the distance to the neighboring peak and shifting the existing peak by N ppm. The algorithm is described in the "Exclusion of non-centroided datasets" section of the article METASPACE-ML: Metabolite annotation for imaging mass spectrometry using machine learning. The percentage of peaks that have overlap is returned.

Steps

file_path = '/home/ubuntu/dataset_01.imzML'
parser = ImzMLParser(file_path)

dataset_statistics = parser.calc_statistics()  # base statistics
dataset_statistics = parser.calc_statistics(n_spectrum=500)  # calculation of statistics based on 500 spectra
dataset_statistics = parser.calc_statistics(full=True)  # all spectra are used to calculate statistics

ppm = 3.0  # or other value, depends of dataset
overlap = parser.check_peaks_overlap(ppm=ppm, n_spectrum=500)

alexandrovteam / pyimzML