Exposing traits that allow mz conversion and added selective DIA reading

jspaezp commented 9 months ago

This PR attempts to improve what can be achieved with the library on the DIA space.

It adds the frame msms window struct, which is meant to represent a section of frame (I envision it being used for all scans that share an isolation window but there is no reason to not use it in other cases).

It also exposes some of the converters to be used externally (RN the frame has very little value, since one cannot convert externally the tof indices to mz values, it would be nice to have access to the implementation inside timsrust, instead of having to re-implement it)

TODO:

Testing, I could add a dia acquisition file we use for testing, which comes from just recording idle flow for ~5 seconds using a DIA method.
Documentation, let me know if you would like more documentation comments to be added.

Questions:

I noticed that read_all_ms1_frames is defined like this:

    fn read_all_ms1_frames(&self) -> Vec<Frame> {
         (0..self.tdf_bin_reader.size())
             .into_par_iter()
             .map(|index| match self.frame_types[index] {
                FrameType::MS1 => self.read_single_frame(index),
                _ => Frame::default(),
            })
            .collect()
     }

is there any reason why the frames that are not MS1 are kept as default frames? (in contrast to first filtering the indices and returning a vec of only MS1 frames?)

LMK what you think!

sander-willems-bruker commented 8 months ago

Dear @jspaezp , I will need some time to process this PR as this is slightly bigger and some of the ideas I also started working on locally already. Apologies, but I promise I will look into this sooner than later!

sander-willems-bruker commented 8 months ago

With regards to your questions/comments:

Yes, a DIA testing dataset would be good! If posible, I would like one that is as small as possible and that we can fully control. While a 5s idle flow is small, it does not allow to manually include test cases. Ifpossible you could look into my (very poorly and quickly written) simulator to create a minimal example?
More documentaiton is always welcome, but I am very poor at doing this myself and will not be a hypocrite asking you to more;). That said, I am a firm believer of minimal documentation, i.e. well-written code in itself should serve as documentation. This is primarily because of my own lazy tendencies, I do not maintain documentation well and it is often outdated (and thus incorrect and misleading) after I implemented something new or refactor it.
The fact that frames are kept as default (i.e. Unknown or None) in that example, means they are not being read from the binary. This speeds up things since roughly half the data is MS1 and the other half is MS2. Notice that this is implemented comparibly for read_all_ms2_frames. The option to read all frames is still possible in case you definatlye will need both. To get to your actual question: by using Empty frames there is no search needed anymore, because it is indexed directly and thus accessible in O(1). RAM concerns of empty frames is irrelevant given the amount of data per (non-empty) frame.

jspaezp commented 8 months ago

Dear @jspaezp , I will need some time to process this PR as this is slightly bigger and some of the ideas I also started working on locally already. Apologies, but I promise I will look into this sooner than later!

I see! are you planning on having some form of "public roadmap" or "help wanted issues"? I would be glad to help implement some things and it would be great to be in the same page.

The fact that frames are kept as default (i.e. Unknown or None) in that example, means they are not being read from the binary. This speeds up things since roughly half the data is MS1 and the other half is MS2. Notice that this is implemented comparibly for read_all_ms2_frames. The option to read all frames is still possible in case you definatlye will need both. To get to your actual question: by using Empty frames there is no search needed anymore, because it is indexed directly and thus accessible in O(1). RAM concerns of empty frames is irrelevant given the amount of data per (non-empty) frame.

That makes a lot of sense! (I wonder performance-wise how it would compare with a hashmap[index -> frame], since it would not have all the empty frames).

I will work on generating the test data for DIA.

sander-willems-bruker commented 1 week ago

Took me long enough, but most should be accessible now in the dev branch. Still need to update docs and do proper error propagation, but essentials are there

sander-willems-bruker commented 1 week ago

@jspaezp . Timsrust 0.3.0 is now available.

The FrameReader should give you all info you need. Note that it now has a rather convenient parallel_filter function (https://github.com/MannLabs/timsrust/blob/main/src/io/readers/frame_reader.rs#L77), that should provide very efficient access to frames without much wasted CPU/RAM usage.
Note that there are no Unknown frames coming of the parallel filter anymore, so reindexing based on frame.index might be needed.
The MetadataReader should give quick access to all converters.
Also note that 0.3.0 is a transient version that needed to be published, even though it is still far from production ready. New version will probably soon, which should have better error propagation and might change the readers

MannLabs / timsrust

Exposing traits that allow mz conversion and added selective DIA reading #7