Investigate the calibration functions for which series to use

hechth commented 9 months ago

The recalList produces a list of outputs with different scores and series identifiers etc. - those are not that well documented in the source code, so it would be great to get some more information. What are the scores, and what do they represent? What does the "series" identifier actually mean? The user should be able to decide which series to choose based on the data provided in the table. Maybe we could then even make another tool which could choose the best recalibration series based on some higher level parameters, like the instrumental method or platform?

KristinaGomoryova commented 1 month ago

So, here it's a bit complicated and I think we will have to re-implement the actual solution.

RecalList outputs a dataframe containing CH2 homologous series that contain more than 3 members, and following metrics:

Series - reports the homologous series according to class, adduct, and DBE. The format is "class_adduct_DBE", for example a homologous series with class = "O6, adduct of Na+, and DBE = 4" would be "O6_Na_4"
Number Observed - reports the number of members of each homologous series.
Series Index - represents the order of the series when ordered by length of homologous series.
Mass Range - reports the minimum and maximum mass for the compounds within a homologous series.
Tall Peak - reports the mass of the most abundant peak in each series.
Abundance Score - reports the percentage difference between the mean abundance of a homologous series and the median abundance within the mass range the "Tall Peak" falls in (for example m/z 200-300).
Peak Score - This column compares the intensity of the tallest peak in a given series to the second tallest peak in the series. This comparison is calculated by log10(Max Peak Intensity/Second Peak Intensity).
Peak Distance - This column shows the number of CH2 units between the tallest and second tallest peak in each series.
Series Score - This column compares the number of actual observations in each series to the theoretical maximum number based on the CH2 homologous series.

The Recal function itself can now take up to 10 series.

To choose the optimal ones, these shall be the criteria:

Series should cover the full mass spectral range
Series should be optimally long and combined have a “Tall Peak” at least every 100 m/z.
Abundance score: the higher, the better
Peak score: the closer to 0, the better
Peak Distance: the closer to 1, the better
Series Score: the closer to this value, the better (here I would guess it's actual_observed/theoretical, so we also want the number to be closest to 1)

KristinaGomoryova commented 1 month ago

The main problem is probably with the computational capacity. We get 225 series using the Raw_Neg_ML data (model data of MFAssignR). We want combination of 10 elements out of all which will try to fulfill the criteria as their best - problem is, that if we even try to pick 10 combinations out of 225, we're suddenly at 7.480909295 E+16 possible combinations, which is way too much.

I tried to pre-filter the dataset in a following manner:

Abundance.Score > 0 (there are even some negative ones, which can be avoided I think, the range is [-97 ; 1347]. I would even say, that we can go like >100 to reduce the size even more, the picked by authors are 167..575.
Peak Distance < 2 (this wants to be as close to 1 as possible, there are some clear 'outliers' as 4, but majority is around 1)

This reduces the size to 94 elements. If we restrict the Abundance score to >100, we get 33. This suddenly reduces the number of combinations to 92,561,040 which is much better I would say :)

KristinaGomoryova commented 1 month ago

How it could be done eventually...

# Libraries required
library(dplyr)
library(gtools)
library(tidyr)

# Data input
data <- read.delim("recalList.tabular")

# Arrange the data
data.subset <- data %>%
  separate(col = Mass.Range, into = c('Min.Mass.Range', 'Max.Mass.Range'), sep = "-") %>%
  mutate(Min.Mass.Range = as.numeric(Min.Mass.Range), 
         Max.Mass.Range = as.numeric(Max.Mass.Range)) %>%
  mutate(Series.Length = Max.Mass.Range - Min.Mass.Range) %>%
  filter(Abundance.Score > 100) %>%
  filter(Peak.Distance < 2) 

global_min <- min(data.subset$Min.Mass.Range) + 100 #tolerance
global_max <- max(data.subset$Max.Mass.Range) - 100 #tolerance

# Create all combinations of ions (composed of 5 elements)
iter <- combinations(nrow(data.subset), 5, v = 1:nrow(data.subset))

coversRange <- data.frame(iter, coversRange = 0)

# Check if the combinations cover the whole data range
for (i in 1:nrow(iter)){
  comb <- iter[i, ]
  subset <- data.subset[comb, ]
  local_min <- min(subset$Min.Mass.Range)
  local_max <- max(subset$Max.Mass.Range)
  if (local_min <= global_min & local_max >= global_max) {
    coversRange$coversRange[i] <- 1
  } 
 #print(i)
}

# Subset only those, which cover whole range
coversRangeTrue <- coversRange[coversRange$coversRange == 1, ]

# Compute the scores
score_combination <- function(combination) {
  series <- paste0(combination$Series)
  total_abundance <- sum(combination$Abundance.Score)
  total_series_length <- sum(combination$Series.Length)
  peak_proximity <- sum(1/(combination$Peak.Score))  # smaller values are better
  peak_distance_proximity <- sum(1/(combination$Peak.Distance - 1))  # closer to 1 is better
  coverage <- sum(combination$Max.Mass.Range - pmax(combination$Min.Mass.Range, lag(combination$Max.Mass.Range, default = 0)))
  coverage_percent <- coverage/((global_max+100) - (global_min-100))*100
  return(list(
    total_abundance = total_abundance,
    total_series_length = total_series_length,
    peak_proximity = peak_proximity,
    peak_distance_proximity = peak_distance_proximity,
    series = series,
    coverage_percent = coverage_percent
  ))
}

scores <- list()

for (i in 1:nrow(coversRangeTrue)){
  comb <- iter[i, ]
  subset <- data.subset[comb, ]
  comb_score <- score_combination(subset)
  scores <- append(scores, list(comb_score))
  print(i)
  }

scores_df <- do.call(rbind, lapply(scores, as.data.frame))

# Filter coverage > 90%, compute overall score, arrange it from the highest to lowest, and take first 10 series
scores_df %>%
  filter(coverage_percent > 90) %>%
  rowwise() %>%
  mutate(sum_score = sum(total_abundance, total_series_length, peak_proximity, peak_distance_proximity, coverage_percent)) %>%
  arrange(desc(sum_score)) %>%
  filter(!duplicated(series)) %>%
  head(10)

RECETOX / MFAssignR

Investigate the calibration functions for which series to use #22