RGLab / flowStats

flowStats: algorithms for flow cytometry data analysis using BioConductor tools
15 stars 10 forks source link

gaussNorm: I don't understand the resulting transformation #54

Open SamGG opened 1 month ago

SamGG commented 1 month ago

I see problems with this code:

https://github.com/RGLab/flowStats/blob/fabb13191a2567e41a89691cab31507d4cbddc7a/R/gaussNorm.R#L497-L500

Code to reproduce my trials

# aim to untouch peak at 0 and shift at 1.24 to 1.74, ie add 0.5
landm = c(0, 0, 1.24, 1.74)
# white noise to visualize the transform
xx = runif(N, -1, 3)
# s from the data = sd(data) ~ 0.6

# problem: ends are not equal to the shift at the nearest landmark
# the negative peak ends at -0.5 in real data
# the positive peak ends at +2.5 in real data
yy = flowStats:::register.function(
    xx, c(.5, .5), landm[c(1,3)], landm[c(1,3)+1]-landm[c(1,3)])
plot(xx, yy); abline(c(0,1)); abline(c(0.5,1))
plot(xx, yy-xx, asp = 1); abline(h=c(0,0.5))

# user should be cautious when setting too small values for s
yy = flowStats:::register.function(
    xx, c(.25, .5), landm[c(1,3)], landm[c(1,3)+1]-landm[c(1,3)])
plot(xx, yy); abline(c(0,1)); abline(c(0.5,1))
plot(xx, yy-xx, asp = 1); abline(h=c(0,0.5))

Here are the 2nd and 4th graphs showing the shift applied to the data as a function of the intensity (from asinh(FluoIntensity/cofactor)).

image

image

@mikejiang @sofieVG @cciccole

gfinak commented 1 month ago

@SamGG, just a heads-up that the authors of the gaussNorm approach are from this paper https://onlinelibrary.wiley.com/doi/10.1002/cyto.a.20823 (14 years ago). I'm not sure if any of them are still involved in computational flow. gaussNorm, as far as I know, is long-since orphaned and unsupported. I'm not sure who can provide insight into what the code is doing or why it does so, or any possible bugs, known or unknown.

SamGG commented 1 month ago

Hi Greg. Thanks for your quick feedback.

Except Ryan, I don't see these names nowdays, and he seems not to be on github. May @tslumley or @fhahne answer?

My main point was to warn potential users and to keep a track if someone wants to investigate.

What I coded for 2 peaks is:

  if(length(m)==2){
    sh=(shift[1]-shift[2])
    w1 = flowStats:::gau(data, s[1], m[1])
    w2 = flowStats:::gau(data, s[2], m[2])
    ws = w1 + w2
    data=data+w1/ws*(sh/2)
    data=data-w2/ws*(sh/2)
    return(data+shift[1]-(sh/2))
  }

This leads to correct results IMHO when s is the same for the 2 peaks.

s = c(0.5, 0.5)

image

s = c(0.25, 0.25)

image

s = c(0.25, 0.5)

image

The latter is still slightly weird (although comprehensive) as s is not similar. This should not arise as the API does not provide to adjust it. BTW, the estimation of s is the sd of all the intensity of the channel (code below). This relies on all the data, not of each peak, which is a coarse estimation in my view.

https://github.com/RGLab/flowStats/blob/fabb13191a2567e41a89691cab31507d4cbddc7a/R/gaussNorm.R#L472-L475

cciccole commented 1 month ago

@SamGG I would need to see more practical examples that show the data -- not just the diff -- in order to understand more clearly, though I don't doubt you've found some sensible improvement.

However, separately, I also think this algorithm has other noteworthy problems so I'm not sure how much thought I can give this particular one 😁 If I recall correctly:

The core of the warping, when it gets it right, seems pretty good. Unfortunately there are many situations where it fails. If the core warping strategy were wrapped with a more robust surrounding layer, that could be valuable.