CellProfiler / CellProfiler-Analyst

Open-source software for exploring and analyzing large, high-dimensional image-derived data.
http://cellprofileranalyst.org
Other
146 stars 73 forks source link

CPA does not remove rows (cells) with missing values in the database [critical] #190

Closed holgerhennig closed 8 years ago

holgerhennig commented 8 years ago

There is a switch in the properties file called “check_tables = yes” which is supposed to do this but it didn’t work. This is critical, otherwise CPA doesn’t work properly.

Recommended fix: Remove all rows (cells) with NULLs in the database. In addition, remove rows with area<1. Example db and properties file: https://www.dropbox.com/s/miz9jya3ywz300m/CellCycleJurkat_org.db?dl=0 https://www.dropbox.com/s/n4gwl53r6yhbvbc/CellCycleJurkat_MyExpt.properties?dl=0

Further info: Whenever a cell hasn’t been segmented correctly (or a feature has no value), the database contains NULLs or zeros for that cell. The NULLs and cells with zero area need to be removed before the machine learning.

jhung0 commented 8 years ago

3bdf26b is supposed to remove rows with missing/null/none data if you have check_tables=yes in the properties file

jhung0 commented 8 years ago

Is that enough? Is it necessary to check the area value?

holgerhennig commented 8 years ago

yes, pls check cell area value - cells with area=0 are artefacts. In our imaging flow cytometry data sets this can occur very frequently (because of the way we construct the image montages), so these rows with area=0 need to be removed for CPA to work properly

braymp commented 8 years ago

Some caution is needed here. For area, a 0-value is not good, but for other measurements, a 0 may be perfectly appropriate. Such measurements should not be removed as a one-size-fits-all solution.

holgerhennig commented 8 years ago

Thanks, Mark for clarifying, I agree- only delete rows with *AreaShape_Area=0, a different feature being zero can be perfectly fine

jhung0 commented 8 years ago

4a30d91 deletes rows with AreaShape_Area = 0

holgerhennig commented 8 years ago

@jhung0 @daviddao Just tested issue 190 with today's nightly build. The issue still remains, unfortunately. To reproduce, take the database and properties file in issue #190 (rename database to CellCycleJurkat.db) and fetch a random cell from image 1.

holgerhennig commented 8 years ago

here are the images (that belong to properties and db file): https://www.dropbox.com/s/quo38z7kabf18yo/Step2_output_tiled_tifs.zip?dl=0

jhung0 commented 8 years ago

Sorry, what does fetching a random cell from image 1 have to do with this? There's another issue with fetching a random cell from an image.

jhung0 commented 8 years ago

So don't choose image, choose new filter and choose the image number that way

holgerhennig commented 8 years ago

@jhung0 @daviddao @AnneCarpenter

Hi Jane- Yes, I did create a filter, chose image #1 and then picked a random cell from that image.

The point is that rows with NULLs are not deleted from the database (in CPAs nightly build), so this issue hasn't been fixed! This issue is critical for using CPA when you have many rows with NULLs in the database (like we do in our Methods paper due July 1..). Sorry for the rush and thanks for looking into this.

Here's a step-by-step description on how to reproduce the bug.

  1. download CPAs nightly build from CP website (I'm using Mac)
  2. start CPA using the database, properties file and images linked in this issue (rename database to CellCycleJurkat.db, you may have to adjust the paths) and go to classifier window in CPA
  3. create new filter with ImageNumber=1, name the filter "ana" (or any other name)
  4. fetch 1 random cell from image ana
  5. clicking on the cell icon results in error: can't find coordinates for object 181 in image 1 (screenshot attached)
  6. look into per object table in database (CellCycleJurkat.db). I'm using sqlitebrowser but any sqlite db viewer will do. Object 181 in image 1 in database has NULLs. Screenshot attached. Thus, CPA didn't remove objects (cells) with NULLs.

My guess is that this bug will also show up with other data sets containing NULLs. I'd highly recommend to test it as well with your own databases at hand (that contain NULLs or where the feature *AreaShape_Area=0). Thanks, best Holger

error_msg database_sqlitebrowser
jhung0 commented 8 years ago

So the way I implemented it, you wouldn't be able to see the removed rows. The rows would just be removed automatically when you score. Thus the per object table would not be altered. On Jun 26, 2016 21:26, "Holger Hennig" notifications@github.com wrote:

@jhung0 https://github.com/jhung0 @daviddao https://github.com/daviddao @AnneCarpenter https://github.com/AnneCarpenter

Hi Jane- Yes, I did create a filter, chose image #1 https://github.com/CellProfiler/CellProfiler-Analyst/issues/1 and then picked a random cell from that image.

The point is that rows with NULLs are not deleted from the database (in CPAs nightly build), so this issue hasn't been fixed! This issue is critical for using CPA when you have many rows with NULLs in the database (like we do in our Methods paper due July 1..). Sorry for the rush and thanks for looking into this.

Here's a step-by-step description on how to reproduce the bug.

  1. download CPAs nightly build from CP website (I'm using Mac)
  2. start CPA using the database, properties file and images linked in this issue (rename database to CellCycleJurkat.db, you may have to adjust the paths) and go to classifier window in CPA
  3. create new filter with ImageNumber=1, name the filter "ana" (or any other name)
  4. fetch 1 random cell from image ana
  5. clicking on the cell icon results in error: can't find coordinates for object 181 in image 1 (screenshot attached)
  6. look into per object table in database (CellCycleJurkat.db). I'm using sqlitebrowser but any sqlite db viewer will do. Object 181 in image 1 in database has NULLs. Screenshot attached. Thus, CPA didn't remove objects (cells) with NULLs.

My guess is that this bug will also show up with other data sets containing NULLs. I'd highly recommend to test it as well with your own databases at hand (that contain NULLs or where the feature *AreaShape_Area=0). Thanks, best Holger

[image: error_msg] https://cloud.githubusercontent.com/assets/5438649/16362393/31ae00c6-3bae-11e6-9af7-dcbdd4951b4e.png [image: database_sqlitebrowser] https://cloud.githubusercontent.com/assets/5438649/16362412/8170ffaa-3bae-11e6-8d52-294e3de3ad77.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CellProfiler/CellProfiler-Analyst/issues/190#issuecomment-228601331, or mute the thread https://github.com/notifications/unsubscribe/AJJbgnWKRjTS3Yve85B-W-KuAFNS36zZks5qPn4bgaJpZM4I1mqp .

holgerhennig commented 8 years ago

Click anywhere on the image in the CPA screenshot and the same error as in the screenshot (with always the same image number) occurs, no matter where you click, even if you click directly on a cell. So you can't construct a training set for image 1. All anaphase cells in our data set are in image 1, so without it we can't construct the training set for anaphase cells. Same issue for telophase and metaphase.

I'd recommend to first clean the data set and then construct training and test set. Then you don't need to navigate around NULLs when constructing the training set.

holgerhennig commented 8 years ago

Current status: Issue 190 is not fixed, we still rely on a sql hack to remove objects with NULLs (or with *AreaShape_Area=0) in the db before starting CPA. Otherwise the training set cannot be constructed in our data set. I haven't found a workaround from within CPA. This issue needs to be fixed for a user-friendly Imaging Flow Cytometry workflow to be functional

AnneCarpenter commented 8 years ago

I think I see what Holger means - let me see if I follow: you're asking that objects meeting your criteria are actually deleted completely from the per-object database? (or that a new database is created with those removed, right?)

It sounds like Jane has built a workaround for one particular function - it ignores objects meeting the criteria when scoring or something. But that sounds like it doesn't fix Holger's problem, and I can imagine it also poses problems for other data visualization tools that go looking for nonsensical data that the user has requested not to see.

Did I capture that correctly, @holgerhennig ? And if so, is it clear enough to you now, @jhung0 ?

holgerhennig commented 8 years ago

yes, that's correct. One could (i) create a new database with the rows with NULLs deleted or (ii) create an index list of objects that are not NULL (or area=0) and work with the index list from then on. I'd suggest the latter but Jane knows better what approach is best in CPA. Either way only objects that have no NULLs are pulled up (for training, scoring, visualizing,...)

AnneCarpenter commented 8 years ago

I see - I think (ii) is problematic because then every function in CPA has to know about the index list and deal with it appropriately which is way more investment than it's worth. Jane, do you see any problems with going with (i)?

jhung0 commented 8 years ago

Yes my latest commit is supposed to do (i) (creating a new table in the database).

On Mon, Jun 27, 2016 at 10:31 PM, Anne Carpenter notifications@github.com wrote:

I see - I think (ii) is problematic because then every function in CPA has to know about the index list and deal with it appropriately which is way more investment than it's worth. Jane, do you see any problems with going with (i)?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CellProfiler/CellProfiler-Analyst/issues/190#issuecomment-228762861, or mute the thread https://github.com/notifications/unsubscribe/AJJbgp4fBVj3TH7dW9WWE2ChEqVNxpC4ks5qP97OgaJpZM4I1mqp .

holgerhennig commented 8 years ago

test result: implementation works like a charm, thanks!

holgerhennig commented 8 years ago

issue can be closed from my side