Continuous improvement of nodule segmentation and volume estimates

caseyfitz commented 6 years ago

After exploring the segmentation code under prediction/src/algorithms/segment/ we have identified a few outstanding issues related to the segmentation functionality and volume calculations. These issues are all interrelated, but we've tried to divide them into two general catagories (whose code paths start in segment/trained_model.py):

Model architecture / complexity (trained_model.predict)
- Model output shape (512, 512, 1024) - the .npy mask saved to segment_path - should not have 1024 slices. Most slices after 200 are uniform, for example in LIDC-IDRI-0003 with value 0.45197698 and an overall range around -0.35 to 0.8
- The simple_3d_model.py and unet_3d_model.py each use the same best_model_Simple3DModel and make identical predictions. However, the full unet will only process some full size test images without throwing a MemoryError.
- It may be too much to try and retrain a new model this late, but it is desireable to have at least one model that accepts any appropriately sized input and outputs the correct shape.
Nodule volume calculation (trained_model.calculate_volume)
- The naive approach using numpy.bincount, which calculates nodule volumes by summing non-zero values in the binary mask saved as lung-mask.npy, does not use centroid information and merely sums non-zero values in the scan, yielding a (poor) total centroid volume rather than the distinct volumes of each centroid in centroids. One negative impact of this is that for n centroids, the predicted volume is just this total volume, n times.
- More advanced brute force approaches using convex hull (scipy.spatial.ConvexHull, skimage.morphology.convex_hull_image) are either too memory intensive or only work with 2d arrays. Plus, it's not clear that a standard convex hull approach would be best anyway, since the entire lungs aren't our interest, but subsets of the lungs (perhaps something like skimage.morphology.convex_hull_object, but this only works on 2d arrays).
- The ideal function (as specified in the doc string for trained_model.calculate_volume) takes a list of centroids as inputs and calculates e.g. 3d connected components given those centroids.
- Note that in the current Simple3DModel, masking of nodules does not perform well and it's possible that there is essentially one large connected component spanning ~200 slices.

The approach to exploring these issues has been to use an interactive jupyter notebook, rooted in the prediection directory of the application. From there, one can use from src.algorithms.segment.trained_model import predict to start playing with the outputs directly and testing changes on the fly. (Pro tip: use the magic %load_ext autoreload to autoreload the functions with your changes everytime you call them.)

And as always, please update documentation too with any new changes for easy points! (The segment predict docs are pretty weak right now.)

vessemer commented 6 years ago

@caseyfitz did you carefully read the code of trained_model.calculate_volume? This code treat centroids as connected components.

caseyfitz commented 6 years ago

Ah, thanks @vessemer! I thought the functionality was clear to me but I must have been confused due to the fact that labels = [mask[centroid['x'], centroid['y'], centroid['z']] for centroid in centroids] was returning [1 1 1 1 1 1] on the six centroids I was passing it (for LIDC-0003). Didn't realize that scipy.ndimage.label has a default structure parameter representing squared connectivity, which should be sufficient for this stage of the project.

The problem then, seems to be that the image has only one connected component, yes? If so, then 2 in the issue statement above should be good to go for now (in which case I'll edit the issue) and the immediate problems are just those in 1.

Make sense?

vessemer commented 6 years ago

Yes, sure. I'll add some comments in trained_model.calculate_volume with my next commit, since there is some obscurity :)

WGierke commented 6 years ago

@caseyfitz Are you planning to merge the changes you did to the code base in your branch at some point to the master? And by the way: nice notebook! :)

drivendataorg / concept-to-clinic

Continuous improvement of nodule segmentation and volume estimates #283