About Pretraining Data Formats

WYC-321 commented 2 years ago

I downloaded the dataset for pre-training on TCIA, but I found that the downloaded data format is .dcm, which is inconsistent with the format of .nii.gz in the json file. I wonder if something is done to do the format conversion？

ahatamiz commented 2 years ago

Hi @WYC-321

We did convert the Dicom files to nifti. In addition, we filtered out some of the outlier cases according to the information provided in the meta info. Please see the json files containing the exact train/val splits in here.

Thanks

WYC-321 commented 2 years ago

Hi, @ahatamiz : Thank you for your answer. After looking at the dataset I have some more detailed questions: (1). Dicom files are simply converted to nifti without any additional processing ?

I noticed that the naming rules in the json file are different from the naming rules of the database. For example, in dataset_TCIAcolon_v2_0.json file, the images are named like this: img_19.nii.gz, but in the TCIA CT Colonography Trial database, the directory paths are like this: CT COLONOGRAPHY\1.3.6.1.4.1.9328.50.4.0019\01-01-2000-1-CT ABD WCONT RECONSTRUCTION-18588. I'm guessing that the 0019 in 1.3.6.1.4.1.9328.50.4.0019 refers to img_19, but there are five subfolders under this directory: 1.000000-NA-18589 (including 1 dicom file)，3.000000-NA-18592 (including 482 dicom files)，5.000000-NA-19075 (including 1 dicom file)，7.000000-NA-19078 (including 438 dicom files)，9.000000-NA-19517 (including 1 dicom file)，11.000000-NA-19520 (including 444 dicom files). So even though I have the json file, I still don't know img19.nii.gz refers to which subfolder. (All data in five subfolders ? Or data in one subfolder ?). There are similar situations for other datasets. And the questions are as follows: (2). How can I link the files in the original database with the files described by json? (3). Some subfolders contain multiple Dicom slices, just concatenate them in order and convert them to a nifti file ? (4). Given the complexity of the details, is it possible to expose a script that converts the raw data to the data described in json file ?

Finally, thanks again for your excellent work and contributions to open source code.

Best wishes !

ahatamiz commented 2 years ago

Hi @WYC-321,

I believe the best way to address your questions is to release the pre-processing pipeline. I have raised the issue regarding this with our team members and the code for pre-processing shall be released very soon.

CC: @wyli

Best

WYC-321 commented 2 years ago

Thanks a lot to your team.

Jamshidhsp commented 2 years ago

@WYC-321 I have the same issue with the code. Could you manage to work it out?

JiaxinZhuang commented 1 year ago

I also download the datasets and try to follow the split in the JSON file. However, for HSNCC as well as TCIAcolon, it's hard to convert to the required nifty file from the downloaded dataset. Because I can't find the corresponding relationship.

JakobDexl commented 1 year ago

@JiaxinZhuang @WYC-321 did you manage to figure it out? I'm also struggling with the naming relationship for the datasets (HNSCC and COLON).

Project-MONAI / research-contributions

About Pretraining Data Formats #59