CellProfiler / CellProfiler-Analyst

Open-source software for exploring and analyzing large, high-dimensional image-derived data.
http://cellprofileranalyst.org
Other
147 stars 72 forks source link

Exclude training set from test set [critical] #192

Closed holgerhennig closed 8 years ago

holgerhennig commented 8 years ago

Navigate to: Classifier window, create training set, hit buttons "Score all" and "Score Image" Currently, training set is included in test set.

Recommended fix: a) Exclude training set from test set. Example: An image has 225 objects, of which 100 objects were used for training. Then "Score all" should show only 125 objects in total for that image

b) mark objects in training set in score image in, say, grey (and add "training set" to legend so that the marks can be shown/hidden in image)

Further info: Per default the training set should be excluded from the test set. The user shouldn't even have the option to score on objects used for training (otherwise the true positive rate suddenly becomes very high which is an artefact).

Example db and properties file (images are imaging flow cytometry montages, typically 225 cells per image, arranged in a grid): https://www.dropbox.com/s/miz9jya3ywz300m/CellCycleJurkat_org.db?dl=0 https://www.dropbox.com/s/n4gwl53r6yhbvbc/CellCycleJurkat_MyExpt.properties?dl=0

jhung0 commented 8 years ago

256e6e4 should have the option to show training and testing objects of each class. Try it out.

holgerhennig commented 8 years ago

b) "score image": issue fixed a) "score all" issue not fixed. The "score all" still includes the training set. This can be very misleading and can lead to very high true positive rates which are just artefacts. Example: Say, we have 68 metaphase cells and 60 cells in the training set. With score all, say we score 66 metaphase cells corrently (this is the case in our data). Almost perfect score (66/68), but we were essentially scoring on the training set! In this example, "score all" should not be used for statistics. Pls exclude training set from "score all"

braymp commented 8 years ago

I suggest instead adding this as an option in the "Score All" dialog and add to the documentation. I can imagine a naive biologist expecting to see these entries and wondering where they are.

AnneCarpenter commented 8 years ago

Ah, I see what's going on here. I agree with Mark but for different reasons - most commonly users DO want to score all (even cells in the training set), because their goal is to get the final answer out of the images regardless of whether the computer scored a cell or whether they scored a cell manually. Excluding them makes no sense in such a case.

This is very different from excluding training images from the calculation of classification accuracy (here, excluding them is of course important).

I don't understand the features/functions we are talking about well enough to provide detailed guidance, but I hope this clarifies to some degree? I guess Score All is what produces the table of final results, does it make sense to include a column indicating whether an object was in the test set (0 if no, 1 if yes) so the user can decide what to do with it?

((BTW, regarding Holger's original comment (b) forgive my lack of knowledge for how it currently works, but if each class is marked in a particular color, I think it makes the most sense for training/test to be coded in a different way (rather than as a different color, gray). For example, you could use a different shaped marker (training = squares, test = dots) or a different line style if we are talking about boxes around each object (training = dotted lines, test = solid lines). That way, if the user cares to examine the classes without caring whether it was manual or automated, they can focus on the color and if they care about assessing accuracy they can focus on the marker shape or line style.))

jhung0 commented 8 years ago

Yes, I'm confused about why score all should exclude the training set. We could possibly add a column indicating whether an object is part of the training set or not.

I could look into different training/testing representations...not sure how straightforward it'd be.

On Mon, Jun 27, 2016 at 10:28 PM, Anne Carpenter notifications@github.com wrote:

Ah, I see what's going on here. I agree with Mark but for different reasons - most commonly users DO want to score all (even cells in the training set), because their goal is to get the final answer out of the images regardless of whether the computer scored a cell or whether they scored a cell manually. Excluding them makes no sense in such a case.

This is very different from excluding training images from the calculation of classification accuracy (here, excluding them is of course important).

I don't understand the features/functions we are talking about well enough to provide detailed guidance, but I hope this clarifies to some degree? I guess Score All is what produces the table of final results, does it make sense to include a column indicating whether an object was in the test set (0 if no, 1 if yes) so the user can decide what to do with it?

((BTW, regarding Holger's original comment (b) forgive my lack of knowledge for how it currently works, but if each class is marked in a particular color, I think it makes the most sense for training/test to be coded in a different way (rather than as a different color, gray). For example, you could use a different shaped marker (training = squares, test = dots) or a different line style if we are talking about boxes around each object (training = dotted lines, test = solid lines). That way, if the user cares to examine the classes without caring whether it was manual or automated, they can focus on the color and if they care about assessing accuracy they can focus on the marker shape or line style.))

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/CellProfiler/CellProfiler-Analyst/issues/192#issuecomment-228761904, or mute the thread https://github.com/notifications/unsubscribe/AJJbghIxdotXKyURBiHNDp3EN_zbwtyYks5qP94OgaJpZM4I1m6x .

AnneCarpenter commented 8 years ago

p.s. to edit my response - I did mean a column indicating whether an object is part of the training set or not, as you say (my post said "whether an object was in the test set (0 if no, 1 if yes) " when I meant "whether an object was in the training set (0 if no, 1 if yes) ")

I think adding a column with that info is the simplest solution that satisfies Holger's concerns, but he should confirm it really does solve his problem before we do anything.

holgerhennig commented 8 years ago

Marks suggestion sounds good to me: adding "score excluding training set" as an option in the "Score All" dialog and add to the documentation. That way the user also knows what CPA is doing (i.e., whether or not the "score all" statistics includes predictions on the training set).

Here's an example. Currently the score all table looks like the screenshot below (it's a per image table, not per object). Consider image 105 (ground truth for metaphase), there were 60 cells in the training set from image 105, so the total cell count excluding the training set would be 8, and the sum of the scores of all classes for image 105 would be 8.

score_all
jhung0 commented 8 years ago

What about just adding columns for total training cell count, training [class] cell count?

On Tue, Jun 28, 2016 at 8:14 PM, Holger Hennig notifications@github.com wrote:

Marks suggestion sounds good to me: adding "score excluding training set" as an option in the "Score All" dialog and add to the documentation. That way the user also knows what CPA is doing (i.e., whether or not the "score all" statistics includes predictions on the training set).

Here's an example. Currently the score all table looks like the screenshot below (it's a per image table, not per object). Consider image 105 (ground truth for metaphase), there were 60 cells in the training set, so the total cell count excluding the training set would be 8, and the sum of the scores of all classes for image 105 would be 8. [image: score_all] https://cloud.githubusercontent.com/assets/5438649/16414724/a709e18c-3d39-11e6-8322-01103362688a.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/CellProfiler/CellProfiler-Analyst/issues/192#issuecomment-229031518, or mute the thread https://github.com/notifications/unsubscribe/AJJbgk0tF-4B2i5SCqhlbCjeTHvC_smpks5qQRAxgaJpZM4I1m6x .

AnneCarpenter commented 8 years ago

Ah, it's a per image table, I get it now. Holger, this is not intended to be used for assessment, just for scoring all objects in all images. I honestly don't think we should add this feature you're requesting at all. I would discourage adding a button or other selection which would complicate the decision-making and interface. It's quite uncommon for the table to be used for assessing accuracy, it's normally used for producing scores for ALL cells because you want to know the answer. I think the use case you describe is just too rare to spend time on or complicate the code.

AnneCarpenter commented 8 years ago

(and therefore I support Jane's suggestion, if Holger thinks it is still useful)

holgerhennig commented 8 years ago

adding columns for total training cell count, training [class] cell count would be great!

jhung0 commented 8 years ago

should be implemented in f13fb61

holgerhennig commented 8 years ago

test result: implementation works great, thx! However, usability could be improved, I'll file corresponding issues