Public dataset preprocessing and public data selection strategy for pretraining

fazlicodes commented 1 year ago

Hi, there are no codes given to process the public datasets, and what are the discarded data from each of the public dataset.

Lee-Gihun commented 1 year ago

We specified the process on the public datasets in our paper (page9 - Public Data Usage).

For clarification, let me elaborate some more details:

OmniPose: the subdirectory names of this dataset are bacteria, worms, ... . We selected the subdirectories with name bacteria and worms.
CellPose: We excluded the obviously non-microscopy images (such as strawberry, jellyfish, stones....) in this datasets.
- (However, we checked that including those non-microscopy images is not harmful to the performance.)
All images are converted to gray scale.
- We converted to grayscale due to the Cellpose datasets using (R, G, Null), where all values in the 3rd dimension in the array are 0. We simple convert the values in the color channels as ((R+G/2), which makes the shapes (H, W, 3) to (H, W). grayscale_array = (original_array[:, :, 0] + original_array[:, :, 1]) / 2
LiveCell, DSBowl2018: We just changed the file extensions to prevent unexpected errors.

We did not use specific processing code to process public data.

fazlicodes commented 1 year ago

Noted, thank you for your quick response!

fazlicodes commented 1 year ago

@Lee-Gihun how did you locate and exclude the non-microscopy images in the cellpose dataset?

Lee-Gihun commented 1 year ago

We manually removed few obvious images from the set (about 10~20 images).

To best my understanding, the original Cellpose paper, they regard the cell segmentation problem as finding the unit entities in the images. Though, they did not mentioned such details in their paper.

At first, I thought it potentially hurts the performance so I removed them. But there was no noticeable difference. This might be the images in the cellpose is only a small portion in our entire pretraining set, and the testing modalities in the challenge datasets does not contain such non-cell entities in the image.

Lee-Gihun / MEDIAR

Public dataset preprocessing and public data selection strategy for pretraining #4