MatthiasPucher / staRdom

staRdom is a package for R to analyse fluorescence and absorbance data of dissolved organic matter (DOM).
21 stars 2 forks source link

white lines when using "extend2largest" #19

Closed crolland1998 closed 2 years ago

crolland1998 commented 3 years ago

Hi!

I am currently using staRdom package to analyse CDOM in mountain streams. I have a set of 94 eem data and I am currently processing the pre-treatment part to get the peaks, and then do a PARAFAC analysis. My problem is that somedays, when I load my eems and apply eem_list <- eem_extend2largest(eem_list, interpolate = TRUE, extend = TRUE), white horizontal lines appear on my graphs (see pictures under). I don't understand where it comes from... This problem does not happen everyday, and at first I though that some of my raw eems were to large, so I deleted from my files the ones that where the heavier, it worked for one day, but know whith all files of more or less the same size, the problem is still here.

pastedImage

I hope that you can help, Best regards, Camille

MatthiasPucher commented 3 years ago

Hi Camille, Thanks for bringing this up. I guess, it might be related to the interpolation, but I would have to check more in detail. Would it be possible, that you send me some files, which produce the problem, so I can have a look at it? Matthias

crolland1998 commented 3 years ago

Hi Matthias,

Sorry for answering so late! But here I got the time to explain better two issues that I have/had with stardom.

1st ISSUE: extend2largest

It took me sometime to understand what was going on. My problem was that after applying extend2largest, I had blank lines, and the residual function after building a PARAFAC model was not working. When trying red2smallest function, the emission vector was set null for all samples.

My data set was split in two in fact, all samples having 202 elements in the emission vector, but the values where slightly shifted between the two sets:

# > eem_getextreme(eem_list[1])
# $em
# [1] 250.334 580.359
#
# $ex
# [1] 240 500
#
# > eem_getextreme(eem_list[100])
# $em
# [1] 249.830 579.864
#
# $ex
# [1] 240 500

Finally, I just changed the em vector in all samples to make sure it was the same, as the shift between similar wavelengths was very small it did not transform my eem. Moreover, it allowed me to avoid interpolating, and keep a smaller set of emission values (see new function below). But indeed, as you mentioned in your mail, the problem was probably interpolation, I did not set the interpolation to TRUE when using extend2largest. I checked that interpolating correctly while applying extend2largest seems to solve the white lines problem, however, I did not try to rebuild a PARAFAC model with the extended2laregest data set.


# Setting the same emission vector for the whole data set :
equal_emission <- function(eem_list, em){
  for (i in 1:length(eem_list)){
    eem_list[[i]]$em <- em
  }
  return(eem_list)
}

# new vector for data set: contain as much values that previous ones, and the increment is the same that previous ones
em = c(250)
for (i in 2:202){
  em[i] <- em[i-1]+(580-250)/201
}

2nd ISSUE: values of B and C loads in my PARAFAC model

I built a PARAFAC model with 223 eem, and 4 components, using nonneg and normalised data set. Previous, I removed Raman/Rayleigh scatters, and outliers, did blank correction etc (mainly following PARAFAC stardom tutorial).

I am using open fluor to compare my results with other models. I am studying Alpine glacial/non-glacial streams, thus CDOM concentration is quite low. When downloading my results for open fluor, I must normalise loads in B and C modes as max in B mode is 2.889 and max in C mode is 1.563, and open fluor does not accept values above 1. It is way above models run with similar samples (alpine, glacial systems), and it is surprising that B mode values are much above C mode values; in most studies C values are much lower. Most studies used DrEEM, and the only other one using Stardom is yours, with also the loads maximum set to 1. I don't know if it comes from stardom PARAFAC calculation, or my set of data. I can share you an extract of it if you want to check (EEM obtained with aqualog, cuve of 1cm, and absorbance data with separated absorbance spectro, cuve of 10cm.

Hope I am clear and thanks for this great package!

Best,

Camille

MatthiasPucher commented 3 years ago

Hi Camille, the issue with the uneven emission wavelengths stems from the measuring in some instruments. I have never worked with one of those. Since you slightly change the wavelengths, I would be careful. I would have some ideas: 1) do not calculate peaks or indices from the samples with adjusted em wavelengths, it is not necessary. 2) an interesting way would be to define a wavelength vector for all samples and generate new data by interpolating the new wavelengths from the original data. 3) I would expect PARAFAC to be robust enough not to cause any problems with what you did. But this is just a guess, that needs to be tested. 4) Since this problem is not related to staRdom only, there might be ways how to approach this in other software. If someone has an idea, I am happy to include this in staRdom.

2nd issue: Since the original data is represented by a factor of vectors, you can multiply any vector with a scalar and as long as you divide another vector by the same scalar, the model does not change as a whole. This means, I think, there is no meaning of the actual maxima in B or C modes. It might stem from the data, the specific solving algorithm and the random initialization. The PARAFAC approach itself does not distinguish between models that are differently rescaled. The function eempf_rescaleBC() does exactly that. My personal way so far is to rescale B and C to a maximum of 1. This means, the A modes represent the fluorescence at the peak of a certain component. In my understanding, is this a meaningful number, since it has a representation in the measurements. My overall suggestion to your second issue is that you are fine, if you normalize B and C to a maximum of 1. If you, or anyone has another opinion, I would be happy to discuss. best wishes Matthias

crolland1998 commented 3 years ago

Hi,

Thank you for your answer. It seems to me that slightly changing the wavelengths did not affect the peak picking nor the PARAFAC model.

Otherwise, I would like to share a new function that I created because I needed to remove more precisely the noise in scattering bands. This function allows to remove above and under part of a scattering band unsymmetrically, which was usefull because in my data set noise under the middle line of scattering bands was more important than above. Hope it might help someone !

# source functions taken from github, required in the code to change other functions

.find_raman_peaks <- function(ex) {
  # For water, the Raman peak appears at a wavenumber 3600 cm lower than the
  # incident wavenumber. For excitation at 280 nm, the Raman peak from water
  # occurs at 311 nm. Source : Principles of Fluorescence Spectroscopy (2006) -
  # Third Edition.pdf

  ## Convert wavenumber from nm to cm
  ex_wave_number <- 1 / ex

  ## For water. 3600 nm = 0.00036 cm
  raman_peaks <- ex_wave_number - 0.00036 # I think Stedmon use 3400 TODO

  ## Bring back to nm
  raman_peaks <- 1 / raman_peaks

  # raman_peaks <- -(ex / (0.00036 * ex - 1))

  return(raman_peaks)
}

.is_eemlist <- function(eem) {
  ifelse(class(eem) == "eemlist", TRUE, FALSE)
}

.is_eem <- function(eem) {
  ifelse(class(eem) == "eem", TRUE, FALSE)
}

# function which allows to remove scattering non symmetrically

eem_remove_scattering_nonsym <- function (eem, type, order = 1, width_under = 5, width_above = 5) {
  stopifnot(.is_eemlist(eem) | .is_eem(eem), all(type %in%
                                                   c("raman", "rayleigh")), is.numeric(order),
            is.numeric(width_under), is.numeric(width_above), length(order) == 1, length(type) ==
              1, length(width_above) == 1, length(width_under) == 1)
  if (.is_eemlist(eem)) {
    res <- lapply(eem, eem_remove_scattering_nonsym, type = type,
                  order = order, width_under = width_under, width_above = width_above)
    class(res) <- class(eem)
    return(res)
  }
  x <- eem$x
  em <- eem$em
  ex <- eem$ex
  if (type == "raman") {
    ex <- .find_raman_peaks(eem$ex)
  }
  ind1 <- mapply(function(x) em <= x, order * ex - width_under)
  ind2 <- mapply(function(x) em <= x, order * ex + width_above)
  ind3 <- ifelse(ind1 == TRUE | ind2 == FALSE, 1, NA)
  x <- x * ind3
  res <- eem
  res$x <- x
  attributes(res) <- attributes(eem)
  attr(res, "is_scatter_corrected") <- TRUE
  class(res) <- class(eem)
  return(res)
}

Best, Camille