chaoyi-wu / RadFM

The official code for "Towards Generalist Foundation Model for Radiology by Leveraging Web-scale 2D&3D Medical Data".
315 stars 32 forks source link

Could this great model be utilized as a pre-trained feature extractor for radiology images without the need for accompanying language inputs? #19

Closed HardworkingLittlequ closed 6 months ago

HardworkingLittlequ commented 7 months ago

Dear Chaoyi,

I trust this message finds you well. I recently had the opportunity to delve into your remarkable work, and I must express my admiration for the innovative approach and substantial contributions outlined in your article.

In particular, the integration of language and vision in your proposed large model captured my attention. The versatility showcased in handling combinations of visual images and language questions is indeed impressive. However, my inquiry pertains to the potential applicability of your model as a standalone feature extractor for radiology images.

Given the success of models like CLIP in serving as effective image encoders, I am curious to know if your model, too, could be employed in a similar capacity. Can it be utilized as a pre-trained feature extractor for radiology images without the need for accompanying language inputs? I am interested in understanding the extent to which your model's capabilities extend to image processing tasks in the domain of radiology.

Thank you for your time and consideration. I look forward to gaining insights into this aspect of your work and exploring potential applications in the realm of medical imaging.

chaoyi-wu commented 6 months ago

Certainly, you might check the fine-tuning results in our paper, the diagnosis tasks are all initialized with the image part only.

To realize, you can simply delete the language part and perceive~(or left perceive while I delete it) in model python files and load the model with strict=false mode.

Generally, the pre-trained ViT is a better initial weights than from scratch while we also notice that the whole model is still more suitable for generation task which use both the language and visual parts, i.e., achieving more gain on VQA-liked tasks than only vision tasks like diagnosis.

Latest, we also release a new vision-only model in https://huggingface.co/QiaoyuZheng/RP3D-DiagModel and https://github.com/qiaoyu-zheng/RP3D-Diag. The model is pre-trained with classification label and I think it is more suitable for classical visual-wise diagnosis tasks while the visual embedding in RadFM is more like to map image to the language space.

HardworkingLittlequ commented 6 months ago

Certainly, you might check the fine-tuning results in our paper, the diagnosis tasks are all initialized with the image part only.

To realize, you can simply delete the language part and perceive~(or left perceive while I delete it) in model python files and load the model with strict=false mode.

Generally, the pre-trained ViT is a better initial weights than from scratch while we also notice that the whole model is still more suitable for generation task which use both the language and visual parts, i.e., achieving more gain on VQA-liked tasks than only vision tasks like diagnosis.

Latest, we also release a new vision-only model in https://huggingface.co/QiaoyuZheng/RP3D-DiagModel and https://github.com/qiaoyu-zheng/RP3D-Diag. The model is pre-trained with classification label and I think it is more suitable for classical visual-wise diagnosis tasks while the visual embedding in RadFM is more like to map image to the language space.

I appreciate your reply and that really helps me a lot!