drivendataorg / concept-to-clinic

ALCF Concept to Clinic Challenge
https://concepttoclinic.drivendata.org/
MIT License
367 stars 146 forks source link

Create a template for algorithm documentation #58

Closed tdraebing closed 7 years ago

tdraebing commented 7 years ago

Expected Behavior

Future documentation of detection algorithms (Issues #18, #19, #20, #21, #22, #23, #24, #25, #26, #27 and #28) should have a consistent structure and make it easy to compare the different algorithms.

Current Behavior

The issues mentioned above ask for documentation of algorithms from the Data Science Bowl. If addressed by several people, the documentation of each algorithm inconsistent and messy, making it unnecessarily harder to read and compare algorithms. Thus the advantage over just using the original documentation would be minimal.

Possible Solution

Creating a template-file specifying sections and content to be filled in as much as possible. This issue thread is also thought of a place to discuss, which information about the algorithm should be included into the documentation.

Possible Implementation

Add a template_algorithms.md file containing the template to the docs/template-folder.

tjvananne commented 7 years ago

TL;DR: competition algorithms predict if patient has cancer, not the probability that each nodule is cancerous, right?

It looks like most of these go through the same z-slice normalization, lung segmentation, and nodule detection/isolation steps. But then it's the final feature generation steps that might need to be refactored quite a bit to tackle the new problem statement of predicting the probability of each nodule being cancerous. Is that a fair statement?

A feature such as total_number_of_nodules (I made that up as an example) aggregated at the patient level might still be important at the nodule level, but it does seem to imply that the goal was slightly different. Interested to hear anyone else's thoughts on this.

If this is the consensus, then I think part of the documentation should include difficulty/feasibility of refactoring features and final ensemble models to predict at the nodule level, not patient level.

reubano commented 7 years ago

Tagging this as official

reubano commented 7 years ago

TL;DR: competition algorithms predict if patient has cancer, not the probability that each nodule is cancerous, right?

Yes

.... Is that a fair statement?

Mostly. I've noticed a few differences in some of the pre-processing. A good first step would be to just enumerate the individual steps and then (roughly) categorize the different ways of doing each step.

... the goal was slightly different.

Correct. Since the initial competition ended, our end user research led us to develop a system that operates on a more granular level (node). As you mentioned, the aggregate stats (total num of cancerous nodules, etc.) are still useful and should be fairly easy to calculate from the individual data.

tdraebing commented 7 years ago

I created a pull request with a first draft of the template. If you have ideas of how to improve it please fill free to comment on it or to directly add a commit.