Mauville / MedCLIP

Medical image captioning using OpenAI's CLIP
57 stars 14 forks source link
captioning clip deep-learning machine-learning medical-imaging what-a-challenge-this-was

= MedCLIP :toc: :toc-placement!: :imagesdir: imagedir/

ifdef::env-github[] :tip-caption: :bulb: :note-caption: :information_source: :important-caption: :heavy_exclamation_mark: :caution-caption: :fire: :warning-caption: :warning: endif::[]

MedCLIP is a medical image captioning Deep Learning neural network based on the https://github.com/openai/CLIP[OpenAI CLIP] architecture.

.Captioning an image. ++++

++++

.Image search is also possible. ++++

++++

toc::[]

== Usage Run main.ipynb on a Colab instance. Weights for the model are provided, so you don't need to train again.

== Introduction CLIP is a beautiful hashing process.

Through encodings and transformations, CLIP learns relationships between natural language and images. The underlying model allows for either captioning of an image from a set of known captions, or searching an image from a given caption. With appropriate encoders, the CLIP model can be optimised for certain domain-specific applications. Our hope with MedCLIP is to help the radiologist emit diagnoses.

== Explanation CLIP works by encoding an image and a related caption into tensors. The model then optimises the last layer of the (transfer learnable) encoders to make both image and text encodings as similar as possible. (1. Contrastive Pretraining)

image::img_README_4.png[loading=lazy]

After the model is successfully trained, we can query it with new information. (2. Zero shot)

. Take an input . Encode with the custom trained encoders . Find a match (image or text) from the known data set. .. Go through each entry of the data set .. Check similarity with the current input .. Output the pairs resulting in most similarity . [Optionally] Measure the similarity between the real caption, and the guessed one.

=== Loss The perfect relationship between encoded images and captions is described by their encoded representations being the same. This similarity can easily be measured by looking at the softmax between the dot product of the encoded inputs; a perfect encoding will yield the identity matrix.

image::img_README_5.png[loading=lazy]

== Tools Used The model was trained using a curated https://medpix.nlm.nih.gov[MedPix] dataset that focuses on Magnetic Resonance, Computer Tomography and X-Ray scans. https://github.com/EmilyAlsentzer/clinicalBERT[ClinicalBERT] was used to encode the text and https://keras.io/api/applications/resnet/[ResNet50] was used for the images.

Similarity between captions was measured using Rouge, Bleu, Meteor and Cider.

== Future Work

Some relevant datasets:

IU Chest X-Ray ChestX-Ray 14 PEIR gross BCIDR CheXpert MIMIC-CXR PadChest ICLEF caption

== Members and Acknowledgements

== Achieved X by doing Y as measured by Z

Implemented a medical image captioning Deep Learning model by using the CLIP model, ResNet50 and ClinicalBERT. We obtained a 61% Rouge similarity rate on our implementation with the MedPix Dataset.

++++

++++