MannLabs / timsrust

Apache License 2.0
19 stars 9 forks source link

Timsrust sage error #15

Open tomthun opened 4 months ago

tomthun commented 4 months ago

Timstof .d directories are now natively supported by rust: https://github.com/lazear/sage/issues/117#issuecomment-1928514516, using timsrust to load them. I just ran some quick tests on some test data and got the following:

(base) PS D:\Data\tools\SAGE> sage .\current_config.json [2024-02-09T11:03:06Z INFO sage] generated 111120583 fragments, 5992882 peptides in 7285ms [2024-02-09T11:03:06Z INFO sage] processing files 0 .. 1 thread 'main' panicked at C:\Users\runneradmin.cargo\registry\src\index.crates.io-6f17d22bba15001f\timsrust-0.2.0\src\file_readers\common\sql_reader.rs:30:62: called Result::unwrap() on an Err value: SqliteFailure(Error { code: Unknown, extended_code: 1 }, Some("incomplete input")) note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

I use the current_config.json for sage. The developer of sage error suggests that this is an issue in Bruker's timsrust library hence i wanted to ask here if there is a solution?

Edit: The raw data can be found here. Sorry, i had the wrong settings for the file, but should now work.

sander-willems-bruker commented 4 months ago

Dear @tomthun . This seems to be diaPASEF data rather than ddaPASEF. You are correct that this indeed is not yet supported. We might look into this in the future, but please take note that this is not a trivial task as the data format is vastly different.

tomthun commented 4 months ago

@sander-willems-bruker Thank you very much for looking into this. A support for diaPASEF would be much appreciated!

Edit: Do you have a rough estimate when it could be available?

jspaezp commented 4 months ago

@tomthun I am not 100% sure what you mean by that. Reading diaPASEF is supported in timsrust (albeit with some limitations and at a reasonably low level). If on the other hand you are asking for diaPASEF support on sage, I am unsure if there are any plans in the near future to support DIA data in general for sage.

Would you mind elaborating?

tomthun commented 4 months ago

I actually thought sage would already support DIA (only roughly but with chimeric deconvolution and dynamic tolerance features i thought it would be possible). It is also listed under the features:

I am currently a bit confused what is actually supported and what not. For me it seems that generally DIA is supported. Maybe @lazear could elaborate.

I would be very happy if DIA would get a full support if possible! :)

lazear commented 4 months ago

There is nothing preventing you from theoretically converting your diaPASEF files to mzML/MGF and searching them with Sage - Sage has no concept of PASEF internally though, so I don't know how the results will look. You can also go and search Thermo/Agilent/etc data in DIA/WWA mode... Bruker is just a bit special šŸ˜‰

Currently, only reading ddaPASEF .d files is supported,since they can be easily converted to the same internal representation of a spectrum as data from the other vendors

jspaezp commented 4 months ago

Thanks for clarifying @tomthun!

You are right @lazear, when I mentioned supporting DIA I mean more on the dia-umpire or peptide-centric search side of things (sorry for disregarding your contribution to open search MS šŸ˜‰ )

Going into why bruker is special and why converting ddaPASEF is trivial but not diaPASEF. (correct me if I am wrong ... this is more my intuition of the problem than a hard answer ...) ddaPASEF, by virtue of being targeted on the ion mobility space, can be converted to a regular "spectrum" by just collapsing the mobility dimension, thus generating a "profile" scan that can be centroided. On the other hand, if the ion mobility is not targeted, collapsing the ion mobility dimension would lead to a lot of information loss, which makes this conversion less practical (and in my experience noisy enough to not be usable in practice).

I would love to hear your thoughts on "centroiding approaches" viable for diaPASEF!

Schematic of "reading diaPASEF data" image

GeorgWa commented 4 months ago

It might come with some surprise but there's actually a package for this called https://github.com/MannLabs/timspeak for clustering and https://github.com/MannLabs/alphasynchro for pseudo MSMS generation. The Package works voth for synchroPasef as well as diaPasef. The workflow was never published as part of a paper but is based on very rigorous work from @swillems.

KlemensFroehlich commented 4 months ago

could you collapse an entire pasef window into a 2D centroided information table? Basically just create one giant mass spectrum that ultimately adds up hundreds of individual mass scans?

image

I have no idea what the data structure looks like or whether this would make sense, but from a technological point of view:

I think the information of the ion mobility axis can definitely be compressed / information can be lost as the ion mobility just inherently has a very low separation power of different ions as compared to the m/z axis....

So I think it would be totally okay to collapse all ions of a PASEF window into just one single MS scan with a centroided m/z info and a centroided ion mobility info.

This should also drastically reduce file size, correct?

I am probably misunderstanding something here so please feel free to ignore me ;) or is this already happening with the clustering in timspeak? Does timspeak actually support writing output to mzml format?

@lazear

There is nothing preventing you from theoretically converting your diaPASEF files to mzML/MGF and searching them with Sage - Sage has no concept of PASEF internally though, so I don't know how the results will

a 5 minute DIA is 1.4GB in size as .d folder. Converted with standard MSConvert settings it is 25GB.... I could of course next time activate gzip on top, but a 30min gradient..... it would still be a huge file. I am currently running SAGE on the 25GB mzml. Will update later how that looks.

there is the timsconvert package https://github.com/gtluu/timsconvert which somehow generates smaller mzml files, but when I push those through sage, it cannot find any precursors with q_val < 0.01. It does find around 25.000 precursors in the 5min gradient, but all of them have a q_val > 0.11.... I guess they also use a costum annotation of ion mobility, which is probably not supported by sage?

I know this is no priority of yours, but I would still love to hear your thoughts! I can also share the data if you are interested of course!

Best, Klemens

edit: somewhat clearer illustration

jspaezp commented 4 months ago

@KlemensFroehlich I believe we are saying the same thing, same as @GeorgWa. But the point I was trying to make (and apparently didn't convey correctly) was that the exact way in which this centroiding needs to happen is not a trivial decision (there are 19 hyperparameters in the timspeak implementation). I would love to know if @sander-willems-bruker has any guidelines on what method+parameters would be "good enough" for most purposes or more specifically why has the centroiding been implemented only for ddaPASEF MS2 scans.

best!

GeorgWa commented 4 months ago

Yes @KlemensFroehlich what you get is in principle a table which is represented as a list of spectra with defined mobility and RT.

The workflow itself is very robust but as you note @jspaezp, it's not a full end to end search engine. Ideally, there would be a minimal required parameter set and all other parameters are optimized in a feedback loop with the search.

Based on my experience centroiding is quite hard to get right and there are no universal parameters. It usually works best if done as a well controlled optimization task.

KlemensFroehlich commented 4 months ago

hi everyone,

thanks for clearing things up for me!

@GeorgWa if you aim to get an iterative approach of centroiding based on the search results, this means it is not your intention to "just" centroid (and provide an mzml after doing so). It would rather be a long term goal to provide an end to end solution from .d folder to search / quant results? Can I ask again if timspeak can export mzml? I just see other formats as output.

Going on a rant here, please again ignore me if I ask very stupid questions: Again, I am not an expert, but I can hardly believe that a high accuracy for the ion mobility centroiding is needed... I would actually bet that low accuracy on the ion mobility dimension centroiding would only lead to minimal loss of identifications (if at all). The ion mobility has a REALLY low resolution / peak capacity

image

Just looking at a background signal here of a highly abundant precursor, which roughly spans from 0.66 to 0.76: Most relevant peptide ions are located between 0.7 and 1.3 1/k0. Even assuming that a regular peptide only spans 0.05 1/k0, this effectively gives us a peak capacity in the dozens. Chromatography gives us peak capacity in the hundreds Mass Resolution in the 1000s.

So please forgive my naivite when I ask: Do you really need 19 hyperparameters or even an iterative approach to perfect centroiding, when one dimension has a really low resolution?

Has anyone tried simple binning of the ion mobility dimension? Maybe with overlap of the binning boundaries? This would also provide a potentially super quick approach, would reduce file size tremendously and still preserve the ion mobility info.

This seems to be implemented in MSconvert. Is that not sufficient here or what are your thoughts on this? image

Best Klemens

GeorgWa commented 4 months ago

Hi Klemens,

Yes, one would finetune the hyperparameters based on confident identifications. I'm speaking from a purely theoretical perspective though. We are not planning this at the moment.

To my knowledge timspeak only supports mgf.

I share your intuition on the resolution of the ion mobility dimension. I think the main benefit manifests on the fragment level, where the strong correlation between precursor mz and the mobility is not an issue.

Let's do two things: I will ping Sander and ask him if there is a way to perform centroiding only on the ion mobility level, not on RT and without connecting clusters. I'm also happy to discuss the general matter in more detail on Zoom if you like. @jspaezp or @tomthun or whoever is interested are of course invited to join.

Best, Georg