MannLabs / alphapept

A modular, python-based framework for mass spectrometry. Powered by nbdev.
https://mannlabs.github.io/alphapept/
Apache License 2.0
168 stars 29 forks source link

MS1_scans and MS2_scans in .ms_data.hdf don't retain scan structure #485

Closed tommasomari closed 2 years ago

tommasomari commented 2 years ago

Describe the bug Looking at the structure of an .ms_data.hdf file, it seems the information on which peaks in MS1_scans or MS2_scans belong to which scan is lost.

To Reproduce Example from my analysis

import tables
hdf = tables.open_file(my_output_file, 'r')
for x in myFile.root.Raw.MS1_scans._f_walknodes():
    print(x)

Output:

/Raw/MS1_scans/indices_ms1 (Array(1174,)) ''
/Raw/MS1_scans/int_list_ms1 (Array(1546349,)) ''
/Raw/MS1_scans/mass_list_ms1 (Array(1546349,)) ''
/Raw/MS1_scans/ms_list_ms1 (Array(1173,)) ''
/Raw/MS1_scans/rt_list_ms1 (Array(1173,)) ''
/Raw/MS1_scans/scan_list_ms1 (Array(1173,)) ''

And similarly for MS2_scans:

for x in myFile.root.Raw.MS2_scans._f_walknodes():
    print(x)

Output:

/Raw/MS2_scans/charge2 (Array(10700,)) ''
/Raw/MS2_scans/indices_ms2 (Array(10701,)) ''
/Raw/MS2_scans/int_list_ms2 (Array(2973805,)) ''
/Raw/MS2_scans/mass_list_ms2 (Array(2973805,)) ''
/Raw/MS2_scans/mono_mzs2 (Array(10700,)) ''
/Raw/MS2_scans/ms_list_ms2 (Array(10700,)) ''
/Raw/MS2_scans/prec_mass_list2 (Array(10700,)) ''
/Raw/MS2_scans/rt_list_ms2 (Array(10700,)) ''
/Raw/MS2_scans/scan_list_ms2 (Array(10700,)) ''

Having int_list_ms and mass_list_ms as simple arrays of mzs and intensity values does not allow to recognise in which scan they belong.

Expected behavior int_list_ms and mass_list_ms could be reported as arrays of arrays, one for each scan.

Screenshots

Screenshot 2022-07-26 at 15 05 32

Example of output when trying to build a dataframe from the HDF5 group to assign the peaks to each scan.

Version (please complete the following information):

straussmaximilian commented 2 years ago

Hi, Thanks for reporting. Indeed the array shapes to not match as we have multiple fragments for each scan. We use the indices to keep track of what belongs to what. Often, we then use np.searchsorted to create a lookup index in the same size, like so: np.searchsorted(indices_, np.arange(len(mass_data)), side='right') - 1.

Below is a code snipped to get from ms_data to a dataframe with scans. Screen Shot 2022-07-26 at 3 56 02 PM

One could potentially have this as a method for the ms_data.hdf to directly read as dataframe, e.g. .read_DDA_ms1_df(). Would this be what you are looking for?

tommasomari commented 2 years ago

Thanks for the quick reply. I hadn't realised the indices were structured this way, now the structure of the hdf is more clear.