MS1_scans and MS2_scans in .ms_data.hdf don't retain scan structure

tommasomari commented 2 years ago

Describe the bug Looking at the structure of an .ms_data.hdf file, it seems the information on which peaks in MS1_scans or MS2_scans belong to which scan is lost.

To Reproduce Example from my analysis

import tables
hdf = tables.open_file(my_output_file, 'r')
for x in myFile.root.Raw.MS1_scans._f_walknodes():
    print(x)

Output:

/Raw/MS1_scans/indices_ms1 (Array(1174,)) ''
/Raw/MS1_scans/int_list_ms1 (Array(1546349,)) ''
/Raw/MS1_scans/mass_list_ms1 (Array(1546349,)) ''
/Raw/MS1_scans/ms_list_ms1 (Array(1173,)) ''
/Raw/MS1_scans/rt_list_ms1 (Array(1173,)) ''
/Raw/MS1_scans/scan_list_ms1 (Array(1173,)) ''

And similarly for MS2_scans:

for x in myFile.root.Raw.MS2_scans._f_walknodes():
    print(x)

Output:

/Raw/MS2_scans/charge2 (Array(10700,)) ''
/Raw/MS2_scans/indices_ms2 (Array(10701,)) ''
/Raw/MS2_scans/int_list_ms2 (Array(2973805,)) ''
/Raw/MS2_scans/mass_list_ms2 (Array(2973805,)) ''
/Raw/MS2_scans/mono_mzs2 (Array(10700,)) ''
/Raw/MS2_scans/ms_list_ms2 (Array(10700,)) ''
/Raw/MS2_scans/prec_mass_list2 (Array(10700,)) ''
/Raw/MS2_scans/rt_list_ms2 (Array(10700,)) ''
/Raw/MS2_scans/scan_list_ms2 (Array(10700,)) ''

Having int_list_ms and mass_list_ms as simple arrays of mzs and intensity values does not allow to recognise in which scan they belong.

Expected behavior int_list_ms and mass_list_ms could be reported as arrays of arrays, one for each scan.

Screenshots

Example of output when trying to build a dataframe from the HDF5 group to assign the peaks to each scan.

Version (please complete the following information):

OS: iOS 12.4
Version: 0.4.7
Installation Type: stable + gui-stable from command line

straussmaximilian commented 2 years ago

Hi, Thanks for reporting. Indeed the array shapes to not match as we have multiple fragments for each scan. We use the indices to keep track of what belongs to what. Often, we then use np.searchsorted to create a lookup index in the same size, like so: np.searchsorted(indices_, np.arange(len(mass_data)), side='right') - 1.

Below is a code snipped to get from ms_data to a dataframe with scans. Screen Shot 2022-07-26 at 3 56 02 PM

One could potentially have this as a method for the ms_data.hdf to directly read as dataframe, e.g. .read_DDA_ms1_df(). Would this be what you are looking for?

tommasomari commented 2 years ago

Thanks for the quick reply. I hadn't realised the indices were structured this way, now the structure of the hdf is more clear.

MannLabs / alphapept

MS1_scans and MS2_scans in .ms_data.hdf don't retain scan structure #485