Swin UNETR Pretraining: HNSCC Data Extraction

coxjoseph commented 1 year ago

When trying to pretrain the Swin Transformer model found in research-contributions/SwinUNETR/Pretrain/, I became aware of a discrepancy between the HNSCC json and the TCIA Colonography json.

The two json files downloaded from the links in the README (dataset_HNSCC_0.json and dataset_TCIAcolon_v2_0.json), while named correctly, both reference images in the images directory. At first I assumed that this just meant I had to somehow rename files from one dataset or renumber based on some ordering. Upon further inspection, however, the two files reference 602 of the same images in the same directory. Reading through the code, it does not seem that these images are handled any differently, leading me to believe that either one of the json files is linked incorrectly or the code is loading in multiple of the same images believing they are from different datasets. If the jsons are correct, could you please advise on how to rename/reorder the image files to correctly pretrain the model?

Here's the short python script to validate that the two files are indeed reading the same images (place both json files in a subdirectory jsons relative to your working directory)

import json

def get_image_paths(json_file: dict) -> set: 
    training_images = json_file['training']  # List of dicts with only one key
    training_paths = [training_image['image'] for training_image in training_images]

    validation_images = json_file['validation']
    validation_paths = [validation_image['image'] for validation_image in validation_images]

    return set(training_paths).union(validation_paths)

if __name__ == '__main__':
    with open('./jsons/dataset_HNSCC_0.json', 'r') as hnscc, \
         open('./jsons/dataset_TCIAcolon_v2_0.json', 'r') as colon:
        hnscc_json = json.load(hnscc)
        colon_json = json.load(colon)

    hnscc_paths = get_image_paths(hnscc_json)
    colon_paths = get_image_paths(colon_json)

    paths_in_common = hnscc_paths.intersection(colon_paths)

    print(f'Found {len(paths_in_common)} paths in common.')

> Found 602 paths in common.

coxjoseph commented 1 year ago

I see now that each dataset is placed in its own directory (dataset/datset1, dataset/datset2, dataset/datset3, dataset/datset4, and dataset/datset8), perhaps the README could be a bit more clear on that. But this leads to a slightly different issue - the number of subjects available from HNSCC data is only 609. The json file in question references images exceeding img_1000.nii.gz. How are the images extracted/processed from the HNSCC dataset?

tangy5 commented 1 year ago

Hi @coxjoseph , thanks so much for the question. The raw HNSCC datasets should be more than 609, there are ~1300 CT volumes. I guess there might be inconsistency when covnerting Dicom images to NIFTI format.

How about this, we have a copy that are already converted to NIFTI, QAed and removed outliers. You can refer to this link to download the HNSCC dataset. https://drive.google.com/file/d/1KU5cq6O1ToN0D7_0YkkV6gZoSSPChSjO/view?usp=share_link

Thanks.

JakobDexl commented 1 year ago

Thanks for your great contribution. Hi @tangy5 , could you please provide more information on the TCIAcolon dataset as well? For example the mapping.json? I'm also having trouble to find the correct relation.

GLARKI commented 2 months ago

I have a similar difficulty regarding reproducibility, and unfortunately, the link from tangy5 no longer works. Would someone be able to help me?

I've downloaded the dataset HNSCC. When looking at the json file, I assume they are the same as the folder id, e.g., 'HNSCC-01-0001'. However, the IDs in this JSON file go until 1100+, but in my downloaded cases, the IDs go to 630.

Is this because the database was updated on request of the PI? See quote on the database website "Version 4: Updated 2024/05/15 Replaced Head-Neck-CT-Atlas clinical data file per PI request. The old version is no longer available.".

Or did I miss something? Maybe if a person has multiple CT's the count still goes up?

Thank you very much in advance!

Project-MONAI / research-contributions

Swin UNETR Pretraining: HNSCC Data Extraction #189