Feature: Implement classification algorithm

reubano commented 7 years ago

Overview

Currently, there is a just a placeholder in the algorithm that classifies nodules in scans. Nodules are areas of interest that might be cancerous. We need to adapt the Data Science Bowl algorithms to predict P(cancer) for a given set of centroids for nodules.

Expected Behavior

Given a model trained to perform this task, a DICOM image, and the nodule centroids, return the P(cancer) for each nodule.

Design doc reference: Jobs to be done > Annotate > Prediction service

Technical details

This feature should be implemented in the prediction/src/algorithms/classify/trained_model/predict method.
Code to train the model should live in the prediction/src/algorithms/classify/src/ folder
A fully serialized version of the model that can be loaded should live in the prediction/src/algorithms/classify/assets/ folder using git-lfs

Out of scope

This feature is a first-pass at getting a model that completes the task with the defined input and output. We are not yet judging the model based on its accuracy or computational performance.

Acceptance criteria

[ ] trained model
[ ] documentation for the trained model (e.g., cross validation performance, data used) and how to re-train it

NOTE: All PRs must follow the standard PR checklist.

reiinakano commented 7 years ago

Not sure if this is a dumb question, but given that there will probably be multiple models in the end, how do you intend to separate them? Would it make sense to to further divide prediction/classify into a subfolder for each model? e.g. prediction/classify/vgg would contain prediction/classify/vgg/src/, prediction/classify/vgg/trained_model/predict/, and prediction/classify/vgg/assets/

Question applicable to #1 and #3

isms commented 7 years ago

@reiinakano Not a dumb question at all, and in fact having a "model zoo" was considered. It's not out of the question as the project moves forward, but until we have something to go from the current goal is just to have a best-available frozen model (+ its preprocessing and training pipeline) for each task.

reiinakano commented 7 years ago

Thanks for the answer. My concern is that you're considering a lot of different algorithms from the get-go (as seen in #18 - #28). What happens when multiple people decide working on different algorithms? Personally I don't think it would be a terrible idea to separate them into subpackages from the start to facilitate parallel development.

As long as they have the same interface, switching to the best-available model is a matter of changing the subpackage name in calls. Also, if the best model moving forward is an ensemble of the algorithms (as is virtually always the case), they're all already nicely separated for easy integration into an ensemble.

isms commented 7 years ago

Agreed that could be a useful refactoring once the pain point exists. If you're interested in working on this bit of architecture I'd encourage you to keep alert for the right time (could be soon) and propose the change as a refactoring; at the moment it's premature.

vessemer commented 7 years ago

As it was already mentioned: different architectures require different preprocessing methods. Whether it will be a good decision if the input of the feature will include the information about preprocessing function either? I hope it'll be a better to open a new issue dedicated to the preprocessing step. In this way, model may be fed along with a parameters' dictionary. The latter will be used in order to create a preprocessing instance. I don't open the issue myself, because I'm not highly confident in how the expected behaviour should look like. Newertheless I'll be glad to implement such features. One more question I have: which backend will be included in requirements?

isms commented 7 years ago

@vessemer Thanks for the thinking around this. I have a couple questions just to make sure I understand!

different architectures require different preprocessing methods. Whether it will be a good decision if the input of the feature will include the information about preprocessing function either?

You mean should we be passing more information from the interface part back to the prediction part? Or something else?

One more question I have: which backend will be included in requirements?

Sorry, not sure I understand the question?

And seems like you probably already grabbed this but as a general observation, our thought is that we'd like to first see one solid implementation even if it will need to be refactored to generalize later on. We're afraid of being "achitecture astronauts" and making things too complicated before everyone has a common point of reference.

vessemer commented 7 years ago

@isms thanks for the quick response. I mean that prior to any model evaluation some preprocessing should take place over the raw DICOM data. I agree with your concern regarding cumbersome architecture at the starting point.

Given a model trained to perform this task

Right now for me, it is not clear what is meant by the 'model': whether it is a model object from some ML backend such as TensorFlow or Theano, or it's some essence of higher level with its own preprocessing method.

reubano commented 7 years ago

I mean that prior to any model evaluation some preprocessing should take place over the raw DICOM data.

Hi @vessemer, I see you've already added some preprocessing logic in your PR, but #58 may give you a bit more context. Essentially, each competition implementation has it's own set of preprocessing steps. During the integration of any individual implementation, its associated preprocessing (and other) steps should be decoupled so that they work together as modular pieces of a functional whole. As @isms said, in the future, we may go down the path of mixing and matching the best pieces of various implementations and/or allow for an ensemble of implementations to work together.

Right now for me, it is not clear what is meant by the 'model': whether it is a model object from some ML backend such as TensorFlow or Theano, or it's some essence of higher level with its own preprocessing method.

You're on the right track with your initial thought. Serialized models will live in the assets folder of their respective step (classify, identify, and segment). In this case it will be algorithms/classify/assets/. And of course, the use of TensorFlow/Theano/etc. will be up to individual implementation.

vessemer commented 7 years ago

Hi @reubano, as it was mentioned in the technical details from the issue #2:

This feature should be implemented in the prediction/classify/trained_model/predict method. Code to train the model should live in the prediction/classify/src/ folder.
A fully serialized version of the model that can be loaded should live in the prediction/classify/assets/ folder using git-lfs.

The classify shouldn't lie in the prediction/src directory but in the prediction itself. Is there a typo in the issue or is it now outdated?

reubano commented 7 years ago

@vessemer thanks for pointing that out! The file structure changed since this issues was first created. The new base path is prediction/src/algorithms/classify/.

vessemer commented 7 years ago

Essentially, each competition implementation has it's own set of preprocessing steps. During the integration of any individual implementation, its associated preprocessing (and other) steps should be decoupled so that they work together as modular pieces of a functional whole. As @isms said, in the future, we may go down the path of mixing and matching the best pieces of various implementations and/or allow for an ensemble of implementations to work together.

@reubano, I've noted that the major part of algorithms aimed at nodule candidates' classification has similar pipelines for preprocessing steps. In general, these pipelines can be separated into two stages: global- (lungs-) and local- (nodule-) levels' operations. It's important to uncouple these stages, i.e. it is neither unnecessary nor consistently to apply zero-mean normalization at the local stage. Moreover, for the local-stage, it is common to apply augmentation on the fly (even at a prediction time). Thus I've introduced the pre-processing class which operates at the global level, along with the possibility to apply it while reading medical files. Related to the data-generators at the local level, it's also quite common to use trivial spatial augmentation, but then the patch approached to the desired shape. The latter transformations may highly deviate from model to model, hence I guess it will be better if we will wrap the model by class with a standardized functionality, just to be able to build 'em once with a desired setup, serialise then and use when ever we want.

lamby commented 7 years ago

Merged in #76

drivendataorg / concept-to-clinic