[PROJECT] Disease prediction using MIPHA

Problem Statement

The purpose of this research project is to predict diseases using several sources of data, such as laboratory tests and ECGs. The model should be flexible and modular, so that its parts can be reused even when all data sources are not available.

The experiments conducted to reach this goal will rely on the MIPHA framework. Its main features are as follows:

[A] Flexible framework allowing for the study of any disease
[B] Ability to include data from various sources
[C] Modular architecture designed for reusability

The framework allows for easy implementation and integration. By offering a structure and conventions, it reduces the amount of code that needs to be rewritten from scratch. It opens up a transdisciplinary avenue of research for collaborative, open-source predictive medicine (similarly to what exists with image prediction).

In the future, the framework's modular approach could even allow for improved explainability.

Desired Outcome

Empirically demonstrate the value of the MIPHA framework for disease prediction. We want to answer the following questions:

[A] Regarding disease prediction
- [ ] #10
- [ ] #11
- [ ] #29
[B] Regarding the use of several sources for disease prediction
- [ ] #12
- [ ] #13
- [ ] #14
[C] Regarding the modularity of MIPHA
- [ ] #15
- [ ] #17
- [ ] #19

Current State

Currently, our models are able to predict stage 4/5 chronic kidney disease up to a year prior using a year of biological history. We have identified the following possible improvements:

Generalize the model to other diseases
Facilitate the retraining of the model for other use cases (e.g. when the available data is different)

Diseases studied

Chronic Kidney Disease (going from stage 2/3 to stage 4/5, or going to stage 4/5 for patients with type II diabetes)
Onset of type II diabetes (usually for people over 40 years old)
Myocardial infarction

Success Criteria

[ ] The model is able to predict diseases with satisfying accuracy (this needs to be defined more precisely, but targetting over 85% recall with over 75% precision is a decent rule of thumb)
[ ] Retraining the modular model is faster than retraining a full machine learning model, with minimal loss of performance
[ ] The addition of data sources increases performance. In the context of this issue, the three data sources we will consider are: demographics (age/sex), laboratory results, and ECGs. If ECGs do not provide good results, we can consider using certain diagnoses instead.

Impact

The model should allow for faster iterations in research, and increase the overall performance of disease prediction models.

The model's high transferability and flexibility would also open perspectives for open-source, collaborative machine learning models for healthcare.

Metrics

Accuracy
Precision
Recall
F1-Score
Matthews correlation coefficient

Research should also be conducted on measuring the transferrability of models.

Constraints

[None]

Solution Architecture

Solution Architecture Description

Data Requirements

Datasets

MIMIC-IV
1
2

Understanding and Exploration

[links to explore issues]
[links to docs/notebooks/knowledge base that inform your solution based on the work done in the project]

Approaches and Experiments

[links to experiment issues]

Future improvements

Implement feature extractors relying on machine learning
Look into adaptive average pooling for the aggregator
Improve the ML model (RNN, Transformers, etc.)
Implement feedback loops into the system

SnowHawkeye / disease-prediction