RGLab / flowIncubator

Routines that don't belong to the core flow packages yet.
1 stars 3 forks source link

impute gates that failed the QA check #10

Open mikejiang opened 11 years ago

mikejiang commented 11 years ago

a new API is going to take a list of failed samples as input and return a list of reference samples

refSamples <- .nearestSamples(gs, "MTG_gate", failedSamples)

Once the reference samples are selected, it should be fairly straightforward to do the the gate imputation with the existing APIs ( getGate and setGate )

#extract reference gates
refGates <- sapply(refSamples,function(i)getGate(gs[[i]],"MTG_gate"))

#impute the gates
setGate(gs[failedSamples],"MTG_gate",refGates)
raphg commented 11 years ago

Really nice Mike. I can't wait to see some real tests. Perhaps on the Newell data.

On Wed, Jun 26, 2013 at 5:10 PM, Mike Jiang notifications@github.comwrote:

a new API is going to take a list of failed samples as input and return a list of reference samples

refSamples <- .nearestSamples(gs, "MTG_gate", failedSamples)

Once the reference samples are selected, it should be fairly straightforward to do the the gate imputation with the existing APIs ( getGate and setGate )

extract reference gatesrefGates <- sapply(refSamples,function(i)getGate(gs[[i]],"MTG_gate"))

impute the gatessetGate(gs[failedSamples],"MTG_gate",refGates)

— Reply to this email directly or view it on GitHubhttps://github.com/RGLab/flowIncubator/issues/10 .

mikejiang commented 11 years ago

By using flowCore::expressionFilter

expression1 <- paste0("`",params,"`>0")
ef <- char2ExpressionFilter(expression1)
> ef
expression filter '`<Blue F 525/50-A>`>0' evaluating the expression:
`<Blue F 525/50-A>`>0

I am able to gate out the outlier events (below zero)

tData <- Subset(tData,ef)

And here is the new emd distance, which seems to be more reasonable. rplot001

raphg commented 11 years ago

Yes it does look more resonable. I am surprised that outliers make such a big difference, because the density is so low. I just did a quick test comparing emd and ks.test, and the latter seems more robust and it's faster. Can you try it (without) filtering outliers to see what it gives.

On Fri, Jun 28, 2013 at 9:56 AM, Mike Jiang notifications@github.comwrote:

By using flowCore::expressionFilter

expression1 <- paste0("",params,">0")ef <- char2ExpressionFilter(expression1)> efexpression filter '<Blue F 525/50-A>>0' evaluating the expression:<Blue F 525/50-A>>0

I am able to gate out the outlier events

tData <- Subset(tData,ef)

And here is the new emd distance, which seems more reasonable. [image: rplot001]https://f.cloud.github.com/assets/1385649/723264/ac65f00c-e013-11e2-9883-0cc2e00a4411.png

— Reply to this email directly or view it on GitHubhttps://github.com/RGLab/flowIncubator/issues/10#issuecomment-20200514 .

mikejiang commented 11 years ago

EM with outlier filter

> system.time(refSamples <- .nearestSamples(gs, "MTG_gate", failedSamples , method = "em")))
   user  system elapsed 
 30.002   0.000  29.466

and ks.test without outlier filter is indeed faster and more robust

> system.time(refSamples <- .nearestSamples(gs, "MTG_gate", failedSamples, method = "ks.test"))
   user  system elapsed 
  4.096   0.000   4.313

rplot001

raphg commented 11 years ago

Great.

On Fri, Jun 28, 2013 at 10:44 AM, Mike Jiang notifications@github.comwrote:

EM with outlier filter

system.time(refSamples <- .nearestSamples(gs, "MTG_gate", failedSamples , method = "em"))) user system elapsed 30.002 0.000 29.466

and ks.test without outlier filter is indeed faster and more robust

system.time(refSamples <- .nearestSamples(gs, "MTG_gate", failedSamples, method = "ks.test")) user system elapsed 4.096 0.000 4.313

[image: rplot001]https://f.cloud.github.com/assets/1385649/723492/8458c868-e019-11e2-9739-732f6923196a.png

— Reply to this email directly or view it on GitHubhttps://github.com/RGLab/flowIncubator/issues/10#issuecomment-20203117 .

gfinak commented 11 years ago

Nice!

On 06/28/2013 01:44 PM, Mike Jiang wrote:

|EM| with |outlier filter|

system.time(refSamples <- .nearestSamples(gs, "MTG_gate", failedSamples , method = "em"))) user system elapsed 30.002 0.000 29.466

and |ks.test| without |outlier filter| is indeed faster and more robust

system.time(refSamples <- .nearestSamples(gs, "MTG_gate", failedSamples, method = "ks.test")) user system elapsed 4.096 0.000 4.313

rplot001 https://f.cloud.github.com/assets/1385649/723492/8458c868-e019-11e2-9739-732f6923196a.png

— Reply to this email directly or view it on GitHub https://github.com/RGLab/flowIncubator/issues/10#issuecomment-20203117.

ramhiser commented 11 years ago

@mikejiang This is nice.

While I think the K-S approach deserves some credit here, it will give some peculiar results in other cases. Consider for example the following contrived example.

In this example, we are trying to determine if x1 is nearer to x2 or x3 using the K-S criterion. Notice that there is no overlap in the figure of the empirical CDFs, but in the density plot, there is some overlap. Because the K-S statistic simply finds the largest difference in empirical CDFs, samples x2 and x3 are equidistant from x1 because there is no overlap of the samples. Even if there were, it would be relatively straightforward to contrive an example where we would choose x3 as often as x2.

With this in mind, I am in favor of computing some divergence (e.g., Kullback-Leibler) between the estimated densities. I will concoct something soon.

raphg commented 11 years ago

In this case, 1 is the largest distance possible, and it makes sense given that there is no overlap between any of the distributions. So it's doing the right thing, that is all three distributions are as far as they can be.

This being said, I am in favor of doing more comparison. John, thanks for looking at it.

On Fri, Jun 28, 2013 at 3:35 PM, John Ramey notifications@github.comwrote:

@mikejiang https://github.com/mikejiang This is nice.

While I think the K-S approach deserves some credit here, it will give some peculiar results in other cases. Consider for example the following contrived example https://gist.github.com/ramey/5888640.

In this example, we are trying to determine if x1 is nearer to x2 or x3using the K-S criterion. Notice that there is no overlap in the figure of the empirical CDFs, but in the density plot, there is some overlap. Because the K-S statistic simply finds the largest difference in empirical CDFs, samples x2 and x3 are equidistant from x1 because there is no overlap of the samples. Even if there were, it would be relatively straightforward to contrive an example where we would choose x3 as often as x2.

With this in mind, I am in favor of computing some divergence (e.g., Kullback-Leibler) between the estimated densities. I will concoct something soon.

— Reply to this email directly or view it on GitHubhttps://github.com/RGLab/flowIncubator/issues/10#issuecomment-20218235 .

mikejiang commented 8 years ago

Add an options passed to specify the samples passes QA and can be served as references. By default, all samples other than failed are used. But sometime it is helpful to narrow it down to a few of really good samples.

mikejiang commented 7 years ago

@gfinak , 2d gate imputing is added through emd2d as you suggested, it is very slow though ,which is one of the reason @raphg asked me to switch to ks.test for 1d gate. (See the previous discussion of this thread). I've added parallel through mc.cores and hopefully it can finish the job in the reasonable time for you.

Here is the example for matching one bad sample against 8 good samples

> system.time(res <- nearestSamples(gs, node = "CD4", failed = "1349_3_Tcell_A06.fcs", gridsize = c(70, 70), mc.cores = 8))
Finding reference sample for: 1349_3_Tcell_A06.fcs
    user   system  elapsed 
1123.817    1.582  160.287 

We can fiddle with gridsize for bkde2D to find the optimal default settings (trade off between speed and accuracy)

mikejiang commented 7 years ago

image

> pData(gs)[, 3:4]
                      Sample Replicate
12828_1_Tcell_A01.fcs  12828         1
12828_2_Tcell_A02.fcs  12828         2
12828_3_Tcell_A03.fcs  12828         3
1349_1_Tcell_A04.fcs    1349         1
1349_2_Tcell_A05.fcs    1349         2
1349_3_Tcell_A06.fcs    1349         3
1369_1_Tcell_A07.fcs    1369         1
1369_2_Tcell_A08.fcs    1369         2
1369_3_Tcell_A09.fcs    1369         3

earth mover on 2d data is pretty expensive, which costs more than 2min with parallel computing in 6 cores.

> system.time(res <- nearestSamples(gs, node = node
+                                   , failed = failed
+                                   , passed = passed
+                                   , gridsize = c(50, 50)
+                                    , method = "em"
+                                   , mc.cores = 6
+                                   )
+             )
 user  system elapsed 
657.522  10.993 156.177
> res
  12828_1_Tcell_A01.fcs    1349_1_Tcell_A04.fcs    1369_1_Tcell_A07.fcs 
"12828_2_Tcell_A02.fcs"  "1349_3_Tcell_A06.fcs"  "1369_2_Tcell_A08.fcs" 

As @raphg suggested, we could use ks.test on each dimension and sum the test statistics, here is the result, which finished the job within seconds and matches the correct replicates!

> system.time(res <- nearestSamples(gs, node = node
+                                   , failed = failed
+                                   , passed = passed
+                                   , method = "ks.test"
+                                   , mc.cores = 6
+                                   )
+             )
   user  system elapsed 
  2.391   1.529   1.179 
> res
  12828_1_Tcell_A01.fcs    1349_1_Tcell_A04.fcs    1369_1_Tcell_A07.fcs 
"12828_3_Tcell_A03.fcs"  "1349_2_Tcell_A05.fcs"  "1369_2_Tcell_A08.fcs"