batmanlab / Mammo-CLIP

Official Pytorch implementation of MICCAI 2024 paper (early accept, top 11%) Mammo-CLIP: A Vision Language Foundation Model to Enhance Data Efficiency and Robustness in Mammography
https://shantanu-ai.github.io/projects/MICCAI-2024-Mammo-CLIP/
Creative Commons Attribution 4.0 International
33 stars 11 forks source link

Pre-training with RSNA text reports #6

Closed devamsheth21 closed 4 months ago

devamsheth21 commented 4 months ago

Hi, I wanted to pre-train the same model with the RSNA dataset. However, since RSNA doesn't have text reports, can we generate the templated text reports from the RSNA dataset attributes using the preprocessing you used for the VinDr dataset ? if so, what modifications would you recommend to the RSNA csv file..?

Thank you

shantanu-ai commented 4 months ago

Hi @devamsheth21 , Thanks for using our repo. You can generate templated texts from any arbitrary image+label datasets. So yes, u can use RSNA dataset for that matter with following caveats:

  1. RSNA has only cancer/no-cancer unlike VinDR which has 8 finding-labels (e.g, mass, calcification etc). Also, our inhouse dataset from UPMC contained screening mammograms+text and screening mammograms does not contain any radiology texts mentioning cancer/no cancer (it has BiRADS 0 indicating suspicious case and follow up required). As a reason, we did not include RSNA data during the pretraining phase in the paper.
  2. Our dataset class and the csv file is catered toward VinDr for now. VinDr has 4 images per patient - LCC, RCC, LMLO and RMLO. so we preprocess and generate the templates for this kind of structure only for now. I think that RSNA may have some patients with more than 5 images, so please select only those patients which has 4 images LCC, RCC, LMLO and RMLO and then use our code directly to generate the reports.
  3. The templates provided in the code is for finding labels which is related to cancer e.g, mass, calcification, architectural distortion etc. We did not provide any template for cancer because point 1. You can either use something like "the patient has cancer" or "the patient has carcinoma" etc or consult a radiologist to come up with dedicated sentences.

With these points, here are the resources to follow to generate reports from labels and include them in the pretraining:

  1. Go the Image-label dataset in our readme.
  2. Checkout this file to generate the dedicate csv file from image+label dataset u need to pretrain Mammo-CLIP.
  3. The input to step 2 is this file. This is file from the official VinDr dataset. This is a sample csv file as a output of step 2.
  4. You can check the logic in the dataset class to find out how we generate the text from labels.
  5. Finally if you want to generate templated text for RSNA, checkout this file for VinDr dataset for reference.

Lmk if you have any further queries.

devamsheth21 commented 4 months ago

Hi @shantanu-ai ,

I have a question regarding the Classification task and table 1 from the paper. :

image

In this table, in the last row, the Mammo-CLIP is pre-trained on the UPMC and VinDr dataset, and finetuned as well as linear probing is done on the VinDr dataset. Now I read the dataset section and VinDr has two splits train and test. So which split is being used for pretraining? Is the same split being used for finetuning and linear probing? The AUC and accuracy results are from which test split?

Also, please clarify if I understood your method correctly: for linear probing and finetuning, you are adding a linear layer on top of the pre-trained vision encoder and training this architecture with CrossEntropy loss and labels. Then for evaluating on a different split of data or held-out test set, right?

shantanu-ai commented 4 months ago

Hi @devamsheth21 , for pretraining using UPMC+VinDR, we use the training set of official vindr. We also split the 10% of the training set for validation. The original test set was completely held-out during pretraining.

For downstream tasks (both linear probing and finetuning), we use the same training set of vindr which was used in pre-training. The numbers here are based on the official test set of vindr.

For linear probing+finetuning, we attach a linear layer on top of the vision encoder. For linear probing, the backbone vision encoder is fixed. For finetuning, the vision encoder was also finetuned. Yes, we use cross Ent loss for training. All evaluation is done of on a held-out test set.