bryanhanson / ChemoSpec

R functions for the chemometric analysis of spectra
https://bryanhanson.github.io/ChemoSpec/
59 stars 21 forks source link

Create Spectra Object from DataFrame #12

Closed rvernica closed 4 years ago

rvernica commented 7 years ago

This is a feature request. From the docs it seems that the only way to create a Spctra Object is to have data stored in files. If data is not originating form files, how can one create a Spectra Object? For example, if data is coming from a database. To generalize, having a way to create Spectra Objects from Data Frames might be useful.

bryanhanson commented 7 years ago

Thanks for the suggestion. Take a look at ?matrix2SpectraObject. It requires the matrix to be in a file which in the short run you could create by writing your data.frame to a file (one extra step). If this function seems like the missing function you wish was there, I can update it to accept a data.frame from the local environment, assuming that samples were in rows and the colnames were in fact the sample names.

rvernica commented 7 years ago

Right, matrix2SpectraObject might work.

So far, I have found useful to store the spectrum data in a Data Frame like this:

df
  inte: num ...
  freq: num ...
  cls: Factor ...
  ...

So, if I have 5 spectra and 100 frequencies, the Data Frame will contain 500 observations. This came in handy when fetching the data from the database and when plotting it with ggplot.

bryanhanson commented 7 years ago

Can you send str(df) for one of these data frames that has more than one spectrum in it? Thx.

rvernica commented 7 years ago

Here is an example:

> str(df)
'data.frame':   10809 obs. of  6 variables:
 $ box         : Factor w/ 9 levels "2551","2552",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ id          : Factor w/ 9 levels "2017-08-02_buffer",..: 8 8 8 8 8 8 8 8 8 8 ...
 $ wave        : num  400 401 402 403 404 405 406 407 408 409 ...
 $ inte.raw    : num  910 910 898 879 872 872 878 885 873 854 ...
 $ cls         : Factor w/ 3 levels "buffer",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ inte.raw.nor: num  0.274 0.274 0.258 0.233 0.224 ...

There are 9 spectra and 1201 frequencies. The frequency is in wave and the intensity is in inte.raw. There are 3 classes and the class is stored in cls.

bryanhanson commented 7 years ago

Just to double-check: df$wave has the wavelength repeated 9x and concatenated, and df$inte.raw is 9 concatenated spectra? And the wavelength values (each set of 9) are identical? Sounds like to reconstruct one takes the first 1201 values of wave and separates inte.raw into 9 groups of 1201 values to give 9 separate spectra plus the wavelengths. If that sounds correct, with this data set, how does one know there are 1201 values? Perhaps length(unique(df$wave))?

rvernica commented 7 years ago

Yes, to all of your questions. There has to be a unique identifier attribute for each spectrum, in this example, box or id are both unique identifiers, notice Factor w/ 9 levels. So, you can do:

> length(df[box=="2551","wave"])
[1] 1201
> nrow(df[box=="2551",])
[1] 1201
bryanhanson commented 7 years ago

I think I would write a function to convert that particular format to something more "tidy" in terms of one row per sample. Then just convert the resulting structure into a Spectra object for direct use in ChemoSpec. Do you want me to take a stab at it? If so, can you save df as an Rdata object and attach? If you want to do it yourself, be sure to call chkSpectra on the final object, and see ?Spectra for the necessary data types.

rvernica commented 7 years ago

df.zip

For the end-user function, I would envision something along the lines of:

> ssp <- as.Spectra(df, name = id, freq = wave, intensity = inte.raw, group = cls)
bryanhanson commented 7 years ago

Try this function and let me know. You'll need to change the extension to .R as Github doesn't accept .R asSpectra.txt

rvernica commented 7 years ago

Looks good. I tried it like this:

as.Spectra(df, freq = "wave", intensities = "inte.raw", names = "id", gr.crit = df$cls,
           units = c("", ""), desc = "")

For user-friendliness, you might not require the quotes around Data Frame variables. For example, in ggplot2, you can do:

> ggplot(df, aes(x = wave, y = inte.raw)) + geom_point()
bryanhanson commented 7 years ago

I would not use gr.crit = df$cls for two reasons: One, gr.crit needs only the unique values, you could possibly use unique(df$cls) but still you have to be careful, these are factors, plus it is evaluating >10K values rather than 3. Second, check your groups, they may not be right. The unique values there IIRC are buffer, buffer_2 and buffer_3. gr.crit is used in a grep process and hence grepping for "buffer" catches all the others. [update: just did this and yes, there is only one group and it is an integer due to taking the underlying encoded levels].

On not-quoting arguments: that would be the NSE world, like much of the tidyverse. To me, the time to program that is much greater than the time to type the quotes, so I'm going to leave that as "an exercise for the reader" as they used to say.

zeehio commented 6 years ago

If this were to be implemented, could you use an S3 method? Something like:

as.Spectra <- function(x, ...) {
  UseMethod("as.Spectra")
}

as.Spectra.data.frame <- function(x, name, freq, intensity, group) {

    # Helper function
    isWholeNo <- function(x, tol = .Machine$double.eps^0.5) {abs(x - round(x)) < tol}   

    # A few checks
    if (length(units) != 2) stop("units should have length 2")

    # Determine dimensions
    no.pts <- length(unique(DF[,freq]))
    no.spec <- length(DF[,freq])/no.pts
    if (!isWholeNo(no.spec)) stop("no.spec was not an integer")

    # Now build the Spectra object
    Spectra <- vector("list", 9)
    Spectra[[1]] <- unique(DF[,freq]) # frequency
    Spectra[[2]] <- matrix(DF[,intensities], nrow = no.spec, byrow = TRUE)
    Spectra[[3]] <- as.character(unique(DF[,names])) # names
    Spectra[[4]] <- rep(NA_character_, no.spec) # groups
    Spectra[[5]] <- rep("black", no.spec) # colors
    Spectra[[6]] <- rep(1L, no.spec) # sym
    Spectra[[7]] <- rep("a", no.spec) # alt.sym
    Spectra[[8]] <- units # units
    Spectra[[9]] <- desc # desc

    # Update groups
    for (i in 1:length(gr.crit)) {
        which <- grep(gr.crit[i], Spectra[[3]])
        if (length(which) == 0) warning("There was no match for gr.crit value ", gr.crit[i], " among the sample names.")
        Spectra[[4]][which] <- gr.crit[i]
        }
    Spectra[[4]] <- as.factor(Spectra[[4]])

    # Clean up and verify

    class(Spectra) <- "Spectra"
    names(Spectra) <- c("freq", "data", "names", "groups", "colors", "sym", "alt.sym", "units", "desc")
    chkSpectra(Spectra)
    return(Spectra)
}

With this approach other packages could implement conversion methods to your Spectra class, making it easier to exchange NMR data between packages.

bryanhanson commented 6 years ago

Hi Sergio… I’m traveling today; I’ll get back to you tonight. Bryan

On Oct 9, 2018, at 6:28 AM, Sergio Oller notifications@github.com wrote:

If this were to be implemented, could you use an S3 method https://adv-r.hadley.nz/s3.html#s3-methods? Something like:

as.Spectra <- function(x, ...) { UseMethod("as.Spectra") }

as.Spectra.data.frame <- function(x, name, freq, intensity, group) {

Helper function

isWholeNo <- function(x, tol = .Machine$double.eps^0.5) {abs(x - round(x)) < tol}

A few checks

if (length(units) != 2) stop("units should have length 2")

Determine dimensions

no.pts <- length(unique(DF[,freq])) no.spec <- length(DF[,freq])/no.pts if (!isWholeNo(no.spec)) stop("no.spec was not an integer")

Now build the Spectra object

Spectra <- vector("list", 9) Spectra[[1]] <- unique(DF[,freq]) # frequency Spectra[[2]] <- matrix(DF[,intensities], nrow = no.spec, byrow = TRUE) Spectra[[3]] <- as.character(unique(DF[,names])) # names Spectra[[4]] <- rep(NAcharacter, no.spec) # groups Spectra[[5]] <- rep("black", no.spec) # colors Spectra[[6]] <- rep(1L, no.spec) # sym Spectra[[7]] <- rep("a", no.spec) # alt.sym Spectra[[8]] <- units # units Spectra[[9]] <- desc # desc

Update groups

for (i in 1:length(gr.crit)) { which <- grep(gr.crit[i], Spectra[[3]]) if (length(which) == 0) warning("There was no match for gr.crit value ", gr.crit[i], " among the sample names.") Spectra[[4]][which] <- gr.crit[i] } Spectra[[4]] <- as.factor(Spectra[[4]])

Clean up and verify

class(Spectra) <- "Spectra" names(Spectra) <- c("freq", "data", "names", "groups", "colors", "sym", "alt.sym", "units", "desc") chkSpectra(Spectra) return(Spectra) } With this approach other packages could implement conversion methods to your Spectra class, making it easier to exchange NMR data between packages.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bryanhanson/ChemoSpec/issues/12#issuecomment-428141822, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIkPqgO0tqyPLpkOfIRRrO9bw3ILZ-qks5ujHpJgaJpZM4OsyqG.

zeehio commented 6 years ago

No hurries! It was just a suggestion. Thanks a lot for your work and have a nice and safe trip!

bryanhanson commented 6 years ago

Sergio, I think something like this is a good idea. Over the years, I've written a lot of scripts for users who have data in all sorts of formats. Many of them need a totally custom approach, but a lot of them have data frames, so the function you suggest would likely get many users most of the way. I think I would let the user disable the checks if desired -- I do that on matrix2SpectraObject because often the names come very mangled and not R-suitable, so you have to run the function, see what you have, and then make a few final adjustments.

I'm currently working on a significant re-working of the ChemoSpec internals, and your suggestion fits in well. It will likely take me about a month but it's on the to-do list. Thank you!

bryanhanson commented 6 years ago

Sergio, do you have an example of another package that you want to convert to Spectra object? I need to test the version of the function I am writing. Thanks.

zeehio commented 6 years ago

Honestly I just have a custom package I am developing for a company, I hope to release it eventually but it's not on my hands.

If you want feedback or a code review I'll be happy to help :smiley:

bryanhanson commented 6 years ago

I think a key question is whether the incoming data frames will have samples in rows or samples in columns. I plan to try to write something that would handle either, but there are a lot of possibilities and I have to think it through a bit. First however, I have to get a fresh version of ChemoSpec out to CRAN.

On Oct 14, 2018, at 1:02 PM, Sergio Oller notifications@github.com wrote:

Honestly I just have a custom package I am developing for a company, I hope to release it eventually but it's not on my hands.

If you want feedback or a code review I'll be happy to help 😃

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bryanhanson/ChemoSpec/issues/12#issuecomment-429643145, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIkPoBoIB90dkhz1Q39kQiiCmZlB-efks5uk25AgaJpZM4OsyqG.

Leprechault commented 5 years ago

If this were to be implemented, could you use an S3 method? Something like:

as.Spectra <- function(x, ...) {
  UseMethod("as.Spectra")
}

as.Spectra.data.frame <- function(x, name, freq, intensity, group) {

  # Helper function
  isWholeNo <- function(x, tol = .Machine$double.eps^0.5) {abs(x - round(x)) < tol}   

  # A few checks
  if (length(units) != 2) stop("units should have length 2")

  # Determine dimensions
  no.pts <- length(unique(DF[,freq]))
  no.spec <- length(DF[,freq])/no.pts
  if (!isWholeNo(no.spec)) stop("no.spec was not an integer")

  # Now build the Spectra object
  Spectra <- vector("list", 9)
  Spectra[[1]] <- unique(DF[,freq]) # frequency
  Spectra[[2]] <- matrix(DF[,intensities], nrow = no.spec, byrow = TRUE)
  Spectra[[3]] <- as.character(unique(DF[,names])) # names
  Spectra[[4]] <- rep(NA_character_, no.spec) # groups
  Spectra[[5]] <- rep("black", no.spec) # colors
  Spectra[[6]] <- rep(1L, no.spec) # sym
  Spectra[[7]] <- rep("a", no.spec) # alt.sym
  Spectra[[8]] <- units # units
  Spectra[[9]] <- desc # desc

  # Update groups
  for (i in 1:length(gr.crit)) {
      which <- grep(gr.crit[i], Spectra[[3]])
      if (length(which) == 0) warning("There was no match for gr.crit value ", gr.crit[i], " among the sample names.")
      Spectra[[4]][which] <- gr.crit[i]
      }
  Spectra[[4]] <- as.factor(Spectra[[4]])

  # Clean up and verify

  class(Spectra) <- "Spectra"
  names(Spectra) <- c("freq", "data", "names", "groups", "colors", "sym", "alt.sym", "units", "desc")
  chkSpectra(Spectra)
  return(Spectra)
}

With this approach other packages could implement conversion methods to your Spectra class, making it easier to exchange NMR data between packages.

First simulate some data

set.seed(123) bands=20 data <- data.frame(matrix(runif(60*bands),ncol=bands)) colnames(data) <- paste0(1:bands)

Structure

str(data)

Convert to Spectra object using zeehio function

test<-as.Spectra(data) Doesn't work for data frame object. Please I need a help, thanks

bryanhanson commented 4 years ago

Closing: the wide variety of possible input formats is probably too hard to handle in a universal way. Better to use the options for importing and add to them as needed.