Open mikejiang opened 11 years ago
Really nice Mike. I can't wait to see some real tests. Perhaps on the Newell data.
On Wed, Jun 26, 2013 at 5:10 PM, Mike Jiang notifications@github.comwrote:
a new API is going to take a list of failed samples as input and return a list of reference samples
refSamples <- .nearestSamples(gs, "MTG_gate", failedSamples)
Once the reference samples are selected, it should be fairly straightforward to do the the gate imputation with the existing APIs ( getGate and setGate )
extract reference gatesrefGates <- sapply(refSamples,function(i)getGate(gs[[i]],"MTG_gate"))
impute the gatessetGate(gs[failedSamples],"MTG_gate",refGates)
— Reply to this email directly or view it on GitHubhttps://github.com/RGLab/flowIncubator/issues/10 .
By using flowCore::expressionFilter
expression1 <- paste0("`",params,"`>0")
ef <- char2ExpressionFilter(expression1)
> ef
expression filter '`<Blue F 525/50-A>`>0' evaluating the expression:
`<Blue F 525/50-A>`>0
I am able to gate out the outlier events (below zero)
tData <- Subset(tData,ef)
And here is the new emd
distance, which seems to be more reasonable.
Yes it does look more resonable. I am surprised that outliers make such a big difference, because the density is so low. I just did a quick test comparing emd and ks.test, and the latter seems more robust and it's faster. Can you try it (without) filtering outliers to see what it gives.
On Fri, Jun 28, 2013 at 9:56 AM, Mike Jiang notifications@github.comwrote:
By using flowCore::expressionFilter
expression1 <- paste0("
",params,"
>0")ef <- char2ExpressionFilter(expression1)> efexpression filter '<Blue F 525/50-A>
>0' evaluating the expression:<Blue F 525/50-A>
>0I am able to gate out the outlier events
tData <- Subset(tData,ef)
And here is the new emd distance, which seems more reasonable. [image: rplot001]https://f.cloud.github.com/assets/1385649/723264/ac65f00c-e013-11e2-9883-0cc2e00a4411.png
— Reply to this email directly or view it on GitHubhttps://github.com/RGLab/flowIncubator/issues/10#issuecomment-20200514 .
EM
with outlier filter
> system.time(refSamples <- .nearestSamples(gs, "MTG_gate", failedSamples , method = "em")))
user system elapsed
30.002 0.000 29.466
and ks.test
without outlier filter
is indeed faster and more robust
> system.time(refSamples <- .nearestSamples(gs, "MTG_gate", failedSamples, method = "ks.test"))
user system elapsed
4.096 0.000 4.313
Great.
On Fri, Jun 28, 2013 at 10:44 AM, Mike Jiang notifications@github.comwrote:
EM with outlier filter
system.time(refSamples <- .nearestSamples(gs, "MTG_gate", failedSamples , method = "em"))) user system elapsed 30.002 0.000 29.466
and ks.test without outlier filter is indeed faster and more robust
system.time(refSamples <- .nearestSamples(gs, "MTG_gate", failedSamples, method = "ks.test")) user system elapsed 4.096 0.000 4.313
[image: rplot001]https://f.cloud.github.com/assets/1385649/723492/8458c868-e019-11e2-9739-732f6923196a.png
— Reply to this email directly or view it on GitHubhttps://github.com/RGLab/flowIncubator/issues/10#issuecomment-20203117 .
Nice!
On 06/28/2013 01:44 PM, Mike Jiang wrote:
|EM| with |outlier filter|
system.time(refSamples <- .nearestSamples(gs, "MTG_gate", failedSamples , method = "em"))) user system elapsed 30.002 0.000 29.466
and |ks.test| without |outlier filter| is indeed faster and more robust
system.time(refSamples <- .nearestSamples(gs, "MTG_gate", failedSamples, method = "ks.test")) user system elapsed 4.096 0.000 4.313
rplot001 https://f.cloud.github.com/assets/1385649/723492/8458c868-e019-11e2-9739-732f6923196a.png
— Reply to this email directly or view it on GitHub https://github.com/RGLab/flowIncubator/issues/10#issuecomment-20203117.
@mikejiang This is nice.
While I think the K-S approach deserves some credit here, it will give some peculiar results in other cases. Consider for example the following contrived example.
In this example, we are trying to determine if x1
is nearer to x2
or x3
using the K-S criterion. Notice that there is no overlap in the figure of the empirical CDFs, but in the density plot, there is some overlap. Because the K-S statistic simply finds the largest difference in empirical CDFs, samples x2
and x3
are equidistant from x1
because there is no overlap of the samples. Even if there were, it would be relatively straightforward to contrive an example where we would choose x3
as often as x2
.
With this in mind, I am in favor of computing some divergence (e.g., Kullback-Leibler) between the estimated densities. I will concoct something soon.
In this case, 1 is the largest distance possible, and it makes sense given that there is no overlap between any of the distributions. So it's doing the right thing, that is all three distributions are as far as they can be.
This being said, I am in favor of doing more comparison. John, thanks for looking at it.
On Fri, Jun 28, 2013 at 3:35 PM, John Ramey notifications@github.comwrote:
@mikejiang https://github.com/mikejiang This is nice.
While I think the K-S approach deserves some credit here, it will give some peculiar results in other cases. Consider for example the following contrived example https://gist.github.com/ramey/5888640.
In this example, we are trying to determine if x1 is nearer to x2 or x3using the K-S criterion. Notice that there is no overlap in the figure of the empirical CDFs, but in the density plot, there is some overlap. Because the K-S statistic simply finds the largest difference in empirical CDFs, samples x2 and x3 are equidistant from x1 because there is no overlap of the samples. Even if there were, it would be relatively straightforward to contrive an example where we would choose x3 as often as x2.
With this in mind, I am in favor of computing some divergence (e.g., Kullback-Leibler) between the estimated densities. I will concoct something soon.
— Reply to this email directly or view it on GitHubhttps://github.com/RGLab/flowIncubator/issues/10#issuecomment-20218235 .
Add an options passed
to specify the samples passes QA and can be served as references.
By default, all samples other than failed are used. But sometime it is helpful to narrow it down to a few of really good samples.
@gfinak , 2d
gate imputing is added through emd2d
as you suggested, it is very slow though ,which is one of the reason @raphg asked me to switch to ks.test
for 1d
gate. (See the previous discussion of this thread).
I've added parallel through mc.cores
and hopefully it can finish the job in the reasonable time for you.
Here is the example for matching one bad sample against 8
good samples
> system.time(res <- nearestSamples(gs, node = "CD4", failed = "1349_3_Tcell_A06.fcs", gridsize = c(70, 70), mc.cores = 8))
Finding reference sample for: 1349_3_Tcell_A06.fcs
user system elapsed
1123.817 1.582 160.287
We can fiddle with gridsize
for bkde2D
to find the optimal default settings (trade off between speed and accuracy)
> pData(gs)[, 3:4]
Sample Replicate
12828_1_Tcell_A01.fcs 12828 1
12828_2_Tcell_A02.fcs 12828 2
12828_3_Tcell_A03.fcs 12828 3
1349_1_Tcell_A04.fcs 1349 1
1349_2_Tcell_A05.fcs 1349 2
1349_3_Tcell_A06.fcs 1349 3
1369_1_Tcell_A07.fcs 1369 1
1369_2_Tcell_A08.fcs 1369 2
1369_3_Tcell_A09.fcs 1369 3
earth mover
on 2d
data is pretty expensive, which costs more than 2min
with parallel computing in 6
cores.
> system.time(res <- nearestSamples(gs, node = node
+ , failed = failed
+ , passed = passed
+ , gridsize = c(50, 50)
+ , method = "em"
+ , mc.cores = 6
+ )
+ )
user system elapsed
657.522 10.993 156.177
> res
12828_1_Tcell_A01.fcs 1349_1_Tcell_A04.fcs 1369_1_Tcell_A07.fcs
"12828_2_Tcell_A02.fcs" "1349_3_Tcell_A06.fcs" "1369_2_Tcell_A08.fcs"
As @raphg suggested, we could use ks.test
on each dimension and sum the test statistics, here is the result, which finished the job within seconds and matches the correct replicates!
> system.time(res <- nearestSamples(gs, node = node
+ , failed = failed
+ , passed = passed
+ , method = "ks.test"
+ , mc.cores = 6
+ )
+ )
user system elapsed
2.391 1.529 1.179
> res
12828_1_Tcell_A01.fcs 1349_1_Tcell_A04.fcs 1369_1_Tcell_A07.fcs
"12828_3_Tcell_A03.fcs" "1349_2_Tcell_A05.fcs" "1369_2_Tcell_A08.fcs"
a new API is going to take a list of failed samples as input and return a list of reference samples
Once the reference samples are selected, it should be fairly straightforward to do the the gate imputation with the existing APIs (
getGate
andsetGate
)