Feature: Implement identification algorithm

pjbull commented 7 years ago

Overview

We need to adapt the Data Science Bowl algorithms to produce possible centroid locations for nodules within an image rather than just P(cancer) for the whole image.

Expected Behavior

Currently, there is a just a placeholder in the algorithm that identifies nodules in scans. Nodules are areas of interest that might be cancerous (or might not be, the goal here is just the potentially concerning areas). This must actually yield centroid locations of potential nodules (X voxels from left, Y voxels from top, Z slice number).

First we need to train a model to perform this task. Then, we need to serialize the model so that it can be loaded from disk and used to make predictions. This trained model should be added to the prediction/src/algorithms/identify/assets/ folder using git-lfs. Finally, we need to write the code in the predict method that will load the model from assets, take in a DICOM image, and yield nodule locations in the specified format.

Design doc reference: Jobs to be done > Detect and select > Prediction service

Technical details

This feature should be implemented in the prediction/src/algorithms/identify/trained_model/predict method
Code to train the model should live in the prediction/src/algorithms/identify/src/ folder
A fully serialized version of the model that can be loaded should live in the prediction/src/algorithms/identify/assets/ folder

Out of scope

This feature is a first-pass at getting a model that completes the task with the defined input and output. We are not yet judging the model based on its accuracy or computational performance.

Acceptance criteria

[ ] trained model for identification
[ ] documentation for the trained model (e.g., cross validation performance, data used) and how to re-train it

NOTE: All PRs must follow the standard PR checklist.

QuantumDamage commented 7 years ago

Hi @pjbull , is there any data set with already marked nodules which can be used to training? As far as I checked, Data Science Bowl has only data related to class, not actual nodules.

pjbull commented 7 years ago

@QuantumDamage Yes! Short answer is that LIDC-IDRI (and thus LUNA) has nodules labeled. Going to add a documentation page for datasets so that we can keep everyone on the same page--thanks for raising it.

QuantumDamage commented 7 years ago

Cool, LUNA even have nice ~7 GB data packages available via torrents. I just started downloading them to see what is actually inside.

RMolero commented 7 years ago

Hi guys, really want to help, but not sure how - any tips/topics where you need help?

reubano commented 7 years ago

Hi @RMolero. You may want to take a look at the issues we've labeled minor. Those should be more beginner friendly :).

QuantumDamage commented 7 years ago

@RMolero You can also use forum for general questions, so they will not be lost in issue specific discussions.

And for example, in this issue, we are looking for model which will be trained to point for centroid of nodule. So despite having another model which will classify CT images, we will also have model which will point to interesting places on images for anyone to examine them personally.

What do you think about it?

ask7 commented 7 years ago

@pjbull Are there any resources for model training, or should we simply use our local machines and provide the trained model file?

reubano commented 7 years ago

Hi @ask7, in general you can view the repos of the individual algorithms (#18 - #28) for instruction on training the data. There have also been a few PRs (#82, #99, #108) that document the algorithms as well. Once you have a found an algorithm you would like to train, you can then use an AWS P2 instance to access a GPU. I used this fast-ai lesson to help myself get started initially. Hope that helps!

dchansen commented 7 years ago

I have adapted the grt123 code to do this and implemented it in my concept-to-clinic repo. The results are not perfect, but certainly much better than nothing. I am a little unclear what is required beyond providing the basic algorithm. Should I verify it on the LUNA dataset? I would also be unable to provide training details, as I am simply using the pretrained model.

Finally, pytorch and tensorflow do not play nice together, so I am currently having to disable the Keras imports in the classification algorithm.

reubano commented 7 years ago

@dchansen

I have adapted the grt123 code to do this and implemented it in my concept-to-clinic repo. The results are not perfect, but certainly much better than nothing.

That's great! At this stage we're not looking at performance. Just for a working implementation.

I am a little unclear what is required beyond providing the basic algorithm. Should I verify it on the LUNA dataset? I would also be unable to provide training details, as I am simply using the pretrained model.

There are essentially 3 requirements

the model / algorithm
training details
proof it works

(1) seems covered. For (2), you would need to provide the steps that should be followed to recreate your pretrained model. This may come from the grt123 readme or our own documentation. (3) Is usually taken care of via new tests added to the repo. In this case however, that isn't practical since I don't think anyone wants a test suite that takes days to run :). Perhaps screenshots of your terminal output?

Also, as noted in this question and my subsequent answer, the goals of the original challenge and this one are slightly different. Anything you can provide to show us that the original pretrained model is still capable of performing under these different working conditions would be very useful.

Finally, pytorch and tensorflow do not play nice together, so I am currently having to disable the Keras imports in the classification algorithm.

That's interesting! Thanks for pointing that out! Please be sure to include details such as this with (2) above. Also please let me know if you have any other questions or concerns!

Serhiy-Shekhovtsov commented 7 years ago

Hi @dchansen, I am working on a related issue: adapting grt123 model(#4). You said you already did that, where can I check your code? Also, if the adaptation is done, I guess we can close the #4.

dchansen commented 7 years ago

My code is available at https://github.com/dchansen/concept-to-clinic Note that it is only the first part (identification) and not the second part. I will try to iron out the last bugs and have a pull request ready tomorrow.

drivendataorg / concept-to-clinic