SOI detection on one file

jorainer commented 3 years ago

Is there a (base) function that does the SOI detection for one file @RogerGinBer ? If so, what input does it require and what does it return?

would be ideal if this function would take a list of mz values as input (along with the intensities? or the retention times?) and whatever other data you need (?) and does return the SOIs, somehow similar to the peak detection functions in here (although there are not that well implemented...).

RogerGinBer commented 3 years ago

I don't have it implemented at the moment as just one function, but instead, it uses two: First, you'd process the mzML file/s with processMS1(), which generates a table of annotated data points (we call it PL or Peak List). This table basically has the following columns: mz, rt, intensity (rtiv), ionic formula and isotope.

Then, you'd run findSOI(), which takes the PL and detects the SOIs for each annotation in the PL (it doesn't use the m/z, but instead it matches the formv column). We went with this design because we usually generate multiple sets of SOIs from the same PL objects (for instance, with and without blank subtraction) and thought a one-step process would be too rigid to set up.

I'm thinking I could encapsulate the whole SOI-generation into one function: mzML files + formula DB + adduct list goes in ----> SOI list for each file goes out (without blank subtraction, we could perform it later (?)) Or, to adapt it more to the abstraction level of do_findChromPeaks_centWave, I could go with the m/z, rt and intensity of each individual file, no problem

So conceptually we'd have something like:

bigFunction(files, formulaDB, adductList){
    lapply(files, function(f){
        pl <- calculatePL(f, formulaDB, adductList)  #Annotate the data points
        soi <- findSOI(pl, 1) #SOI detection
        return(soi)
    })
}

Btw, you can find a pre-processed RHermesExp object in the RHermes package, so you can see what the structures look like: test <- readRDS(system.file("extdata/exampleObject.rds", package = "RHermes"))

It would be particularly interesting if you could look into the SOI list, to see which column names we'd have to adapt, etc. (also sorry for the lack of an S4 accessor for SOIList within a SOI object) sois <- RHermes::SOI(test, 1)@SOIList

jorainer commented 3 years ago

Thanks for the detailed explanation!

IMHO it was a good approach to keep the functionality in two separate functions - easier to maintain etc. So, that looks nice. I think it would even be better if you could maybe split the calculatePL into two functions: one that takes a file as input and reads the m/z values from that and then passes all data to the next one that actually calculates the PL.

The reason: in xcms we use the OnDiskMSnExp to represent the MS data. So, you basically read first your full experiment and you could then also subset and filter that object before running the peak detection on it. That's basically how we do it, first read the data, do some exploratory analyses, restrict to a retention time range in which we have signal and then call the peak detection on it. If we would now pass the file name to the calculatePL we would calculate them on the full data, but we subsetted the data set before.

So, what I would suggest:

calculatePL(filem ...) {
  ## read the file input and pass the required imported data to the next function
  calculatePLmz(...)
}

calculatePLmz <- function(mz, ...) {
  ## calculate the PL on the m/z values from the file
}

What data do you exctly need to calculate the PL? It would not be a problem to specifically extract just the data you need from the OnDiskMSnExp and pass that then to your function. Also, I think it would be good to do this on a per-file-basis, as this would allow an efficient parallel processing of the data.

I'll have a look at your SOI, thanks for the link.

RogerGinBer commented 3 years ago

So processMS1, which is the user-level function that calculates the PLs (sorry for mixing up names, but calculatePLwas more representative of what it does), first does some preprocessing steps on the formula and adduct databases, calculates distinguishable isotopologues, and then generates the PL for each file. Here's what this preprocessing looks like:

This preprocessing is necessary to calculate a PL for each file, since we need:

A list of target M0 m/z to search for in the data (one for each possible annotation)
The raw MS1 data (mz, rt and intensity)
The allowed ppm error
A list of isotopes to detect specifically for each annotation
A BiocParallel object used to parallelize the inner function

Here's what this part looks like (notice that we do it on a per-file-basis and also perform some prefiltering on the raw data by removing <1000 intensity signals):

So I think the adaptation is pretty easy on the RHermes side:

Encapsulate all the preprocessing steps into one function -> Easier to export to XCHermes
Move to an OnDiskMSnExp data structure from which the data would be extracted on a per-file-basis (up until now we were running mzR::openMSfile(), peaks() and header() for each individual file)

This way the XCHermes function would (i) receive an OnDiskMSnExp object (plus formula and adduct DB), (ii) perform the RHermes preprocessing, (iii) extract the required info from each file in the OnDiskMSnExp object and (iv) calculate the corresponding PLs.

Does that sound good? 👍

RogerGinBer commented 3 years ago

@jorainer , I've noticed (as you said) that using a list of file paths is very restrictive for the users that want to prefilter their data, so I've implemented the function findSOIpeaks in the style of findChromPeaks: (i) pass an OnDiskMSnExp object along with some parameters, (ii) extract and annotate the data points of each file within the OnDiskMSnExp, (iii) detect SOIs for each file and (iv) return a SOI dataframe with all the SOIs.

jorainer commented 3 years ago

This sounds very good 👍 ! Let me know if you need any help/information/internals or if I should have a look at the code or try it out.

RogerGinBer commented 3 years ago

So now that the detectSOIpeaks is mostly functional, could you have a quick look at it and try processing some test files (like faahKO, sacurine dataset, or any other you like)? Just to make sure I haven't made any blunders in the process and that it works smoothly 😅

RogerGinBer / qHermes

SOI detection on one file #2