BMEII-AI / RadImageNet

RadImageNet, a pre-trained convolutional neural networks trained solely from medical imaging to be used as the basis of transfer learning for medical imaging applications.
MIT License
340 stars 35 forks source link

Duplicates in RadImageNet dataset #17

Open StefanDenn3r opened 8 months ago

StefanDenn3r commented 8 months ago

Hi, first of all, thanks for publishing the RadImageNet dataset! While working with it, I discovered that there are quite some duplicate entries, when checking the MD5 hash of the files.

  1. Different pathology (i.e. different folder). This would then essentially be a multi-label setting, e.g.
    • CT/lung/interstitial_lung_disease/lung009382.png and CT/lung/Nodule/lung009382.png (Note: same filename)
    • MR/af/Plantar_plate_tear/foot040499.png and MR/af/plantar_fascia_pathology/ankle027288.png (Note: different filename)
  2. Same pathology, e.g. MR/af/hematoma/foot079779.png and MR/af/hematoma/ankle053088.png
  3. Neighboring samples, e.g. US/gb/usn309850.png and US/gb/usn309851.png
  4. Others, e.g. US/ovary/usn326815.png and US/kidney/usn348701.png

So far, I haven't checked if the duplicates are across your utilized dataset split, but since you write in your paper that you split patient wise, this shouldn't be the case.

However, the following questions arise:

  1. Since, from my understanding of the paper, this dataset is intended as a single-label, not a multi-label dataset, I am confused to find samples as in the first case. Now the question arises, can the dataset be considered as a multi-label dataset where all 165 pathologies are labeled in all images if present?

  2. For the cases 2.-4. those duplicates are just creating an imbalance but don't provide additional information. Are you planning to remove them?

    In total this results in: Number of duplicate groups: 62751 Total duplicate files: 126074

I attached a duplicates.json with all the duplicates found. It's a dictionary where each key is a MD5 hash and its value is a list of image paths with that hash.

Here is the script I wrote to detect the duplicates, to ensure reproducibility.

import hashlib
from pathlib import Path
from typing import Dict, List, Tuple
from tqdm import tqdm
import json
import argparse

def process_image_md5(image_path: Path):
    """
    Generate MD5 hash for the given image.

    Parameters:
    image_path (Path): The path to the image file.

    Returns:
    tuple: A tuple containing the image path and its MD5 hash.
           If an error occurs, the hash will be None.
    """
    hash_md5: str = hashlib.md5()
    with open(image_path, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return image_path, hash_md5.hexdigest()

def find_duplicates(root_directory: Path):
    """
    Finds duplicate images in the given directory based on MD5 hash.

    Parameters:
    root_directory (Path): The root directory to search for images.

    Returns:
    dict: A dictionary where each key is a MD5 hash and its value is a list of image paths with that hash.
    """
    image_paths: List[Path] = list(root_directory.rglob("*.png"))
    results: List[Tuple(Path, str)] = [
        process_image_md5(image_path) for image_path in tqdm(image_paths)
    ]

    hash_paths_dict: Dict[str, List[str]] = {}
    for image_path, hash in results:
        image_path: Path = str(image_path.relative_to(root_directory))
        if hash:
            if hash in hash_paths_dict:
                hash_paths_dict[hash].append(image_path)
            else:
                hash_paths_dict[hash] = [image_path]

    return hash_paths_dict

def save_duplicates_to_json(duplicates: Dict[str, List], filename: Path):
    """
    Saves the duplicates dictionary to a JSON file.

    Parameters:
    duplicates (dict): The duplicates dictionary.
    filename (Path): The path to the JSON file where the results will be saved.
    """
    with open(filename, "w") as file:
        json.dump(duplicates, file, indent=4)

def main():
    """
    Main function to handle command line arguments and invoke duplicate finding and saving.
    """
    parser = argparse.ArgumentParser(
        description="Find and save duplicates in a dataset."
    )
    parser.add_argument(
        "root_directory", type=Path, help="Root directory of the images"
    )
    parser.add_argument(
        "json_filename", type=Path, help="Filename to save the duplicates JSON"
    )
    args = parser.parse_args()

    print(
        f"Searching for duplicates in {args.root_directory} and writing to {args.json_filename}"
    )

    hash_paths_dict: Dict[str, List[str]] = find_duplicates(args.root_directory)

    duplicates: Dict[str, List[str]] = {
        hash: paths for hash, paths in hash_paths_dict.items() if len(paths) > 1
    }

    save_duplicates_to_json(duplicates, args.json_filename)

    # Number of duplicate groups
    num_duplicate_groups: int = len(duplicates)

    # Number of duplicate files
    num_duplicate_files: int = sum(len(paths) for paths in duplicates.values())

    print(f"Duplicates saved to {args.json_filename}")
    print(f"Number of duplicate groups: {num_duplicate_groups}")
    print(f"Total duplicate files: {num_duplicate_files}")

if __name__ == "__main__":
    main()
StefanDenn3r commented 8 months ago

Furthermore, there are quite some samples which are just empty:

"ee4d421f59bd462d212ce24753493da4": [
        "US/thyroid/usn418235.png",
        "CT/abd/normal/abd-normal039513.png",
        "CT/abd/normal/abd-normal063889.png",
        "CT/abd/normal/abd-normal053270.png",
        "CT/abd/normal/abd-normal002528.png",
        "CT/abd/normal/abd-normal046244.png",
        "CT/abd/normal/abd-normal016491.png",
        "CT/abd/normal/abd-normal056338.png",
        "CT/abd/normal/abd-normal016490.png",
        "CT/abd/normal/abd-normal068850.png",
        "CT/abd/normal/abd-normal047066.png",
        "CT/abd/normal/abd-normal049461.png",
        "CT/abd/normal/abd-normal039070.png",
        "CT/abd/normal/abd-normal018254.png",
        "CT/abd/normal/abd-normal017359.png",
        "CT/abd/normal/abd-normal063883.png",
        "CT/abd/normal/abd-normal017361.png",
        "CT/abd/normal/abd-normal016112.png",
        "CT/abd/normal/abd-normal021904.png",
        "CT/abd/normal/abd-normal051763.png",
        "CT/abd/normal/abd-normal014918.png",
        "CT/abd/normal/abd-normal035602.png",
        "CT/abd/normal/abd-normal034951.png",
        "CT/abd/normal/abd-normal016113.png",
        "CT/abd/normal/abd-normal068847.png",
        "CT/abd/normal/abd-normal069026.png",
        "CT/abd/normal/abd-normal020266.png",
        "CT/abd/normal/abd-normal058409.png",
        "CT/abd/normal/abd-normal016492.png",
        "CT/abd/normal/abd-normal014914.png",
        "CT/abd/normal/abd-normal016117.png",
        "CT/abd/normal/abd-normal068845.png",
        "CT/abd/normal/abd-normal024872.png",
        "CT/abd/normal/abd-normal063887.png",
        "CT/abd/normal/abd-normal006929.png",
        "CT/abd/normal/abd-normal038923.png",
        "CT/abd/normal/abd-normal016489.png",
        "CT/abd/normal/abd-normal014920.png",
        "CT/abd/normal/abd-normal068849.png",
        "CT/abd/normal/abd-normal002527.png",
        "CT/abd/normal/abd-normal023009.png",
        "CT/abd/normal/abd-normal028309.png",
        "CT/abd/normal/abd-normal068848.png",
        "CT/abd/normal/abd-normal012402.png",
        "CT/abd/normal/abd-normal002529.png",
        "CT/abd/normal/abd-normal027692.png",
        "CT/abd/normal/abd-normal014917.png",
        "CT/abd/normal/abd-normal038921.png",
        "CT/abd/normal/abd-normal063886.png",
        "CT/abd/normal/abd-normal017360.png",
        "CT/abd/normal/abd-normal014922.png",
        "CT/abd/normal/abd-normal016493.png",
        "CT/abd/normal/abd-normal063884.png",
        "CT/abd/normal/abd-normal048245.png",
        "CT/abd/normal/abd-normal026267.png",
        "CT/abd/normal/abd-normal050842.png",
        "CT/abd/normal/abd-normal068196.png",
        "CT/abd/normal/abd-normal059727.png",
        "CT/abd/normal/abd-normal004505.png",
        "CT/abd/normal/abd-normal039068.png",
        "CT/abd/normal/abd-normal025695.png",
        "CT/abd/normal/abd-normal043546.png",
        "CT/abd/normal/abd-normal066519.png",
        "CT/abd/normal/abd-normal038082.png",
        "CT/abd/normal/abd-normal045106.png",
        "CT/abd/normal/abd-normal030263.png",
        "CT/abd/normal/abd-normal066054.png",
        "CT/abd/normal/abd-normal002526.png",
        "CT/abd/normal/abd-normal054393.png",
        "CT/abd/normal/abd-normal034294.png",
        "CT/abd/normal/abd-normal039072.png",
        "CT/abd/normal/abd-normal002524.png",
        "CT/abd/normal/abd-normal019419.png",
        "CT/abd/normal/abd-normal007039.png",
        "CT/abd/normal/abd-normal020776.png",
        "CT/abd/normal/abd-normal018253.png",
        "CT/abd/normal/abd-normal014921.png",
        "CT/abd/normal/abd-normal011770.png",
        "CT/abd/normal/abd-normal050088.png",
        "CT/abd/normal/abd-normal039073.png",
        "CT/abd/normal/abd-normal058178.png",
        "CT/abd/normal/abd-normal056852.png",
        "CT/abd/normal/abd-normal063888.png",
        "CT/abd/normal/abd-normal048689.png",
        "CT/abd/normal/abd-normal017358.png",
        "CT/abd/normal/abd-normal016115.png",
        "CT/abd/normal/abd-normal002651.png",
        "CT/abd/normal/abd-normal014915.png",
        "CT/abd/normal/abd-normal040704.png",
        "CT/abd/normal/abd-normal058408.png",
        "CT/abd/normal/abd-normal002530.png",
        "CT/abd/normal/abd-normal038924.png",
        "CT/abd/normal/abd-normal038770.png",
        "CT/abd/normal/abd-normal063005.png",
        "CT/abd/normal/abd-normal012403.png",
        "CT/abd/normal/abd-normal005817.png",
        "CT/abd/normal/abd-normal038922.png",
        "CT/abd/normal/abd-normal039071.png",
        "CT/abd/normal/abd-normal059091.png",
        "CT/abd/normal/abd-normal048244.png",
        "CT/abd/normal/abd-normal016116.png",
        "CT/abd/normal/abd-normal069025.png",
        "CT/abd/normal/abd-normal039069.png",
        "CT/abd/normal/abd-normal025111.png",
        "CT/abd/normal/abd-normal060343.png",
        "CT/abd/normal/abd-normal021311.png",
        "CT/abd/normal/abd-normal018255.png",
        "CT/abd/normal/abd-normal067181.png",
        "CT/abd/normal/abd-normal014919.png",
        "CT/abd/normal/abd-normal064379.png",
        "CT/abd/normal/abd-normal011865.png",
        "CT/abd/normal/abd-normal048784.png",
        "CT/abd/normal/abd-normal029406.png",
        "CT/abd/normal/abd-normal063885.png",
        "CT/abd/normal/abd-normal016114.png",
        "CT/abd/normal/abd-normal062317.png",
        "CT/abd/normal/abd-normal068846.png",
        "CT/abd/normal/abd-normal054395.png",
        "CT/abd/normal/abd-normal054303.png",
        "CT/abd/normal/abd-normal002400.png",
        "CT/abd/normal/abd-normal044746.png",
        "CT/abd/normal/abd-normal068195.png",
        "CT/abd/normal/abd-normal066049.png",
        "CT/abd/normal/abd-normal033146.png",
        "CT/abd/normal/abd-normal068705.png",
        "CT/abd/normal/abd-normal014916.png",
        "CT/abd/normal/abd-normal057421.png",
        "CT/abd/normal/abd-normal051215.png",
        "CT/abd/normal/abd-normal065509.png",
        "CT/abd/normal/abd-normal009734.png",
        "CT/abd/normal/abd-normal032744.png",
        "CT/abd/normal/abd-normal022428.png",
        "CT/abd/normal/abd-normal028845.png",
        "CT/abd/normal/abd-normal028748.png",
        "CT/abd/normal/abd-normal063890.png",
        "MR/mriabd/normal/mri-abd-normal047494.png",
        "MR/mriabd/normal/mri-abd-normal047532.png",
        "MR/mriabd/normal/mri-abd-normal047530.png",
        "MR/mriabd/normal/mri-abd-normal047531.png",
        "MR/mriabd/normal/mri-abd-normal047529.png",
        "MR/mriabd/normal/mri-abd-normal047492.png",
        "MR/mriabd/normal/mri-abd-normal047493.png",
        "MR/mriabd/normal/mri-abd-normal047491.png",
        "MR/af/chondral_abnormality/ankle026697.png"
    ],