DeepMicroscopy / QuiltCleaner

Automatic cleaning of the QUILT-1M pathology dataset
MIT License
3 stars 0 forks source link

Filtering Images #1

Open Awj2021 opened 2 months ago

Awj2021 commented 2 months ago

Hi, this is an amazing work to deal with the bad iamges in such a large-scale dataset! I have some questions.

  1. besides the generative tasks, would you once try to re-train the CLIP model using the filterred dataset?
  2. After filtering, how many images are used to train your model?
  3. Except comparing the Metrics, like FID for generative tasks, do you use other metrics to evaluate your methods' performance?

I really appreciate it if you could give some tips about my question.

Best.

maubreville commented 2 months ago

Hey there,

thanks a lot for this feedback. Regarding your questions:

  1. This is on my list of stuff to try soon, though I haven't done it, yet. GPUs are currently a bit busy, and so am I, but it's for sure on my list.
  2. After filtering with the CONCH scores, I have around 100k images, after filtering only on the impurities I have around 173k images. I believe I used the median CONCH score of the whole dataset, which would explain that one is not 50% of the other.
  3. Mostly visual evaluation for now besides FID. I think this would also come down to evaluating it on downstream tasks for a CLIP-like model, such as zero shot classification or other stuff. Do you have any good ideas here? :-)
Awj2021 commented 2 months ago

Thank you for your kind reply. Except for filtering the low-quality images, can we use some methods, like the detection method to crop those images with portraits, and unrelative info? Except for the images, some texts are also incomprehensible and unrelative.

Since after filtering, the results images are around 100k, it's only 15% of the original dataset size(768K). So I think it's maybe a better way to keep the Histopathology part in the image, e.g., detection & crop.

maubreville commented 2 months ago

Yes, I absolutely think you are right. It makes a lot of sense to use segmentation+postproc or detection methods to crop the actual parthology image parts on top of our initial filtering approach.

I think the question if text and image are aligned should be reflected - at least in principle - by filtering using the CONCH scores. Definition of a suitable threshold is to be done, of course. Quality of text might interfere with this score, and could be done prior to this, given a suitable classifier is available.

BTW: Where does your estimate of 768k images come from? To the best of my understanding, the QUILT1M dataset consists of 1M descriptions, linked to some 650K images.

Best regards,

Marc

Awj2021 commented 2 months ago

Thank you for pointing out my mistake. Sorry, I am confused about it, the 768K is the image-text pairs as the official website shows.

maubreville commented 2 months ago

No worries - I was also confused that the "1M" dataset only contains around 650M images. ;-) Cheers.