DigitalSlideArchive / HistomicsTK

A Python toolkit for pathology image analysis algorithms.
https://digitalslidearchive.github.io/HistomicsTK/
Apache License 2.0
394 stars 117 forks source link

Initial feature set - implementation #66

Closed cooperlab closed 8 years ago

cdeepakroy commented 8 years ago

@cooperlab can we assign this to @slee172 while you work on nuclei segmentation?

cdeepakroy commented 8 years ago

Seems like we have to port FeatureExtraction.m in the ActiveLearning repo to python. Scikit-image has support for regionprops, Haralick features based on the gray-level cooccurence matrix (GLCM), and a few other features. We can use the pandas.DataFrame to store the features.

cdeepakroy commented 8 years ago

It would be nice to implement the ExtractFeatures function in such a way that i can tell which features to compute rather than computing all of them. Then, in the front-end website we can have way to select features/feature-groups via check boxes.

cooperlab commented 8 years ago

Great idea - assign away.

@slee172 capture the easy features first. Let's just dump all feature definitions into a single function for now. Long term we need to group these so that users can select subsets for computation.

cooperlab commented 8 years ago

@cdeepakroy definitely. We need to find the right granularity so that we don't overwhelm the users. Shape is an obvious sub-category. Beyond that things get harder to categorize nicely.

cdeepakroy commented 8 years ago

We can use the same strategy as the graycoprops in scikit-image to allow the user of the feature-extraction function to explicitly specify the features/feature-groups he wants to compute.

cdeepakroy commented 8 years ago

For the GUI, below is what ilastik seems to be doing:

image

cooperlab commented 8 years ago

I looked at this - I think we can go with similar groupings for 'shape', 'intensity', 'edge', 'texture', and a category for 'special' to accommodate things that don't map nicely to the other four groups.

May be better to allow users to define parameter ranges in textboxes, or a comma separated list of discrete values etc.

slee172 commented 8 years ago

@cdeepakroy As you said, Scikit-image supports for regionprops as follows: https://github.com/scikit-image/scikit-image/blob/master/skimage/measure/_regionprops.py#L348

I just tested it as below:

############################################
from skimage import data
from skimage.feature import canny
from skimage.measure import regionprops
from skimage.measure import label

#### read sample
Image = data.coins()

#### detect edges of sample using canny filter
edges = canny(Image, sigma=3,
                     low_threshold=10,
                     high_threshold=80)

#### label edges
label_image = label(edges)

#### extract feature information
props = regionprops(label_image)
print(props[0].centroid)
############################################

@cooperlab I think the next step is to read tile images from a svs file and extract featrues from them. What do you think about using ReinhardSample.py to get tile images?

For pandas, this would be a good way of dealing with those features. Because pandas provides HDFStore using read_hdf, we could handle the features once we store them on HDF5. But, at this time, how about handle the features directly without HDF5? I think some issues shoud be solved before using HDF5.

cooperlab commented 8 years ago

@slee172 - for now just assume that someone is providing a label image. You are working backwards from the feature extraction, I am working forwards from the segmentation, and we will meet in the middle.

Please test performance for reading/writing large HDF5s. I have heard complaints that this is slow in Python. Each slide will contain up to ~1M objects with 40-200+ features each. Data should be organized in the HDF5 so that we can access the features from individual tiles without reading the entire file into memory.

cdeepakroy commented 8 years ago

@slee172 As Lee says for now work on implementing the function for Feature Extraction that takes as input: label-map, intensity image, any other parameters, computes the features, and produces maybe as output - a pandas data frame (you can include centroid/location in it - we can selectively exclude it while doing classification).

With the performance concerns, we will have to benchmark different options and pick one. Maybe call the function you are about to implement to compute features per tile, get the pandas data frame it returns, add more columns indicating the tile id/info, and post it to a table in the database.

@zachmullen and @brianhelba do you guys have any insights on how best to store this data into girder, so we can query them quickly without having to load the whole dataframe/database-table (of nuclei features) into memory?

cooperlab commented 8 years ago

@cdeepakroy I would be skeptical about stashing feature data in a database but am open to hearing about this. Will probably be very slow and results may be discarded frequently. Will be easy to test though with some random matrices.

cooperlab commented 8 years ago

@cdeepakroy Also keep in mind that in the future we may be doing regional analysis including classification. Here features are defined over a dense grid on this image instead of sparsely over objects in the slide. Not trying to get ahead of ourselves, but I don't see those types of features being very compatible with a DB since they are essentially images themselves.

zachmullen commented 8 years ago

@cdeepakroy I share @cooperlab 's concern, depending on the size of the dataframe; MongoDB (sensibly) limits documents to <=16MB in size, so anything that doesn't fit in that should be stored as a file.

The real question is what information we want to index on. Whatever that ends up being, that's what we should put into the database so we can filter by it.

cooperlab commented 8 years ago

@zachmullen We clearly want to index on location. We may want to index on the values of the features themselves though too in some cases. @dgutman Any thoughts on this?

zachmullen commented 8 years ago

Location should be quite straightforward, hopefully other features will be as well. Dense/raster data should be stored as files, and the documents in the database should just contain file ID's.

cooperlab commented 8 years ago

OK. I thought of HDF5 originally as kind of a compromise - it's a file but you can organize it for fast spatial access. That wouldn't support fast feature "indexing" though.

zachmullen commented 8 years ago

HDF5 is a good candidate as the data format to store the file, and we could write special plugin code to get data out of it if we need to expose spatial indexing, and then we can use the database to add any external indexes that we needed. Of course, at that point we'd want to make sure that the file data was immutable, lest we run into data synchronization issues.

zachmullen commented 8 years ago

(netcdf has python bindings, so using it within a girder plugin would be straightforward)

cooperlab commented 8 years ago

It will be immutable. Most datasets will fall into two categories - those that we host for an analysis or external resource (permanent) or those that are generated for development/tuning of algorithms (temporary). Neither would be modified, but insertion/generation needs to be fast for the second case.

slee172 commented 8 years ago

@cdeepakroy A function for basic feature extraction is implemented. The function reads a label image, grayscale image (as an example), and some parameters, and returns a panda data frame.

cooperlab commented 8 years ago

@slee172 I did some work on the boundary cleanup and other misc. things we will need to finish the nuclear segmentation pipeline. My goal is to do a PR by Weds that should capture the segmentation and utility functions.

slee172 commented 8 years ago

@cooperlab sounds good. Before then, I will add other features to the panda data frame.

slee172 commented 8 years ago

@cooperlab Generating a group of FSDs features from labels requires some functions such as "GetBounds", "PixIndex", "InterpolateArcLength", and so on. I'm currently using those functions in one python file at this time. Do you have any idea of how to upload those functions into here? I think one of straightforward idea is to push them separately in different file.

cooperlab commented 8 years ago

Only add them as individual files if they will be utilized by other HistomicsTK functions (doubtful). Otherwise add them as helper functions inside of another file.

Lee Cooper, Ph.D. Assistant Professor of Biomedical Informatics Assistant Professor of Biomedical Engineering Emory University School of Medicine - Georgia Institute of Technology lee.cooper@emory.edu tel: (404) 712-0110


From: slee172 notifications@github.com Sent: Thursday, April 21, 2016 10:51:38 AM To: DigitalSlideArchive/HistomicsTK Cc: Cooper, Lee; Mention Subject: Re: [DigitalSlideArchive/HistomicsTK] Initial feature set - implementation (#66)

@cooperlabhttps://github.com/cooperlab Generating a group of FSDs features from labels requires some functions such as "GetBounds", "PixIndex", "InterpolateArcLength", and so on. I'm currently using those functions in one python file at this time. Do you have any idea of how to upload those functions into here? I think one of straightforward idea is to push them separately in different file.

You are receiving this because you were mentioned. Reply to this email directly or view it on GitHubhttps://github.com/DigitalSlideArchive/HistomicsTK/issues/66#issuecomment-212956612


This e-mail message (including any attachments) is for the sole use of the intended recipient(s) and may contain confidential and privileged information. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this message (including any attachments) is strictly prohibited.

If you have received this message in error, please contact the sender by reply e-mail message and destroy all copies of the original message (including attachments).

slee172 commented 8 years ago

Adding them as helper functions could be better.

cooperlab commented 8 years ago

I think we have a mature initial feature set now and can close this issue.