RECETOX / MFAssignR

The MFAssignR package was designed for multi-element molecular formula (MF) assignment of ultrahigh resolution mass spectrometry measurements. A number of tools for internal mass recalibration, MF assignment, signal-to-noise evaluation, and unambiguous formula selections are provided.
GNU General Public License v3.0
0 stars 2 forks source link

Investigate the calibration functions for which series to use #22

Closed hechth closed 2 weeks ago

hechth commented 9 months ago

The recalList produces a list of outputs with different scores and series identifiers etc. - those are not that well documented in the source code, so it would be great to get some more information. What are the scores, and what do they represent? What does the "series" identifier actually mean? The user should be able to decide which series to choose based on the data provided in the table. Maybe we could then even make another tool which could choose the best recalibration series based on some higher level parameters, like the instrumental method or platform?

KristinaGomoryova commented 1 month ago

So, here it's a bit complicated and I think we will have to re-implement the actual solution.

RecalList outputs a dataframe containing CH2 homologous series that contain more than 3 members, and following metrics:

The Recal function itself can now take up to 10 series.

To choose the optimal ones, these shall be the criteria:

KristinaGomoryova commented 1 month ago

The main problem is probably with the computational capacity. We get 225 series using the Raw_Neg_ML data (model data of MFAssignR). We want combination of 10 elements out of all which will try to fulfill the criteria as their best - problem is, that if we even try to pick 10 combinations out of 225, we're suddenly at 7.480909295 E+16 possible combinations, which is way too much.

I tried to pre-filter the dataset in a following manner:

This reduces the size to 94 elements. If we restrict the Abundance score to >100, we get 33. This suddenly reduces the number of combinations to 92,561,040 which is much better I would say :)

KristinaGomoryova commented 1 month ago

How it could be done eventually...

# Libraries required
library(dplyr)
library(gtools)
library(tidyr)

# Data input
data <- read.delim("recalList.tabular")

# Arrange the data
data.subset <- data %>%
  separate(col = Mass.Range, into = c('Min.Mass.Range', 'Max.Mass.Range'), sep = "-") %>%
  mutate(Min.Mass.Range = as.numeric(Min.Mass.Range), 
         Max.Mass.Range = as.numeric(Max.Mass.Range)) %>%
  mutate(Series.Length = Max.Mass.Range - Min.Mass.Range) %>%
  filter(Abundance.Score > 100) %>%
  filter(Peak.Distance < 2) 

global_min <- min(data.subset$Min.Mass.Range) + 100 #tolerance
global_max <- max(data.subset$Max.Mass.Range) - 100 #tolerance

# Create all combinations of ions (composed of 5 elements)
iter <- combinations(nrow(data.subset), 5, v = 1:nrow(data.subset))

coversRange <- data.frame(iter, coversRange = 0)

# Check if the combinations cover the whole data range
for (i in 1:nrow(iter)){
  comb <- iter[i, ]
  subset <- data.subset[comb, ]
  local_min <- min(subset$Min.Mass.Range)
  local_max <- max(subset$Max.Mass.Range)
  if (local_min <= global_min & local_max >= global_max) {
    coversRange$coversRange[i] <- 1
  } 
 #print(i)
}

# Subset only those, which cover whole range
coversRangeTrue <- coversRange[coversRange$coversRange == 1, ]

# Compute the scores
score_combination <- function(combination) {
  series <- paste0(combination$Series)
  total_abundance <- sum(combination$Abundance.Score)
  total_series_length <- sum(combination$Series.Length)
  peak_proximity <- sum(1/(combination$Peak.Score))  # smaller values are better
  peak_distance_proximity <- sum(1/(combination$Peak.Distance - 1))  # closer to 1 is better
  coverage <- sum(combination$Max.Mass.Range - pmax(combination$Min.Mass.Range, lag(combination$Max.Mass.Range, default = 0)))
  coverage_percent <- coverage/((global_max+100) - (global_min-100))*100
  return(list(
    total_abundance = total_abundance,
    total_series_length = total_series_length,
    peak_proximity = peak_proximity,
    peak_distance_proximity = peak_distance_proximity,
    series = series,
    coverage_percent = coverage_percent
  ))
}

scores <- list()

for (i in 1:nrow(coversRangeTrue)){
  comb <- iter[i, ]
  subset <- data.subset[comb, ]
  comb_score <- score_combination(subset)
  scores <- append(scores, list(comb_score))
  print(i)
  }

scores_df <- do.call(rbind, lapply(scores, as.data.frame))

# Filter coverage > 90%, compute overall score, arrange it from the highest to lowest, and take first 10 series
scores_df %>%
  filter(coverage_percent > 90) %>%
  rowwise() %>%
  mutate(sum_score = sum(total_abundance, total_series_length, peak_proximity, peak_distance_proximity, coverage_percent)) %>%
  arrange(desc(sum_score)) %>%
  filter(!duplicated(series)) %>%
  head(10)