TesterTi / LIDCToolbox

LIDC Matlab Toolbox
GNU General Public License v2.0
20 stars 17 forks source link

output file issue #10

Closed monjoybme closed 5 years ago

monjoybme commented 5 years ago

Hi, I'm generating the LIDC labeled dataset using your toolbox, but when I'm comparing the results with your sampled output of "LIDC-IDRI-0002", "LIDC-IDRI-0004", and "LIDC-IDRI-0005", I can see in your file you have only two files, i.e., GT_id1.tif and GT_id2.tif. In my case its four GT_id1.tif, GT_id2.tif, GT_id3.tif, GT_id4.tif. Most of them are blank. Even Image and mask folders are fully black images. Moreover, in the "slice_correspondences.txt" file, filenames are different. Questions:

  1. Could you tell me how I can be sure that this code is generating the actual ground truth?
  2. "Each of these folders contains the files GT_id1.tif ... GT_isY.tif where Y is the number of readers found for that particular scan (each file is a binary image where ones denote the markers GT)." Is the reader denote the number of experts/radiologists? If yes, which results we should consider for training & testing?
TesterTi commented 5 years ago

Hi Monjoy,

RE the blank images: Are you sure that they are blank or is it just that the image viewer thinks they are blank (because they are binary, for example)? Did you read them into Matlab and check that the matrices contain non-zero values? Some ground truths can be blank, i.e. if 4 annotators have annotated a scan but in one slice only one annotator marks something, the toolbox will output 3 blank annotations and 1 with the reader's annotation (i.e. 3 annotators didn't find anything to annotate). The images and masks shouldn't be blank though.

For LIDC-IDRI-0004 and LIDC-IDRI-0005 I also have 4 reader outputs (see the sample output folders). For LIDC-IDRI-0002 I have two outputs because the first two readers only annotate <3mm nodules, which are ignored by the toolbox. Therefore you can assume that there are an additional two annotations but they are all blank (for the purposes of >3mm nodules). I am surprised that you have 4 GT files for LIDC-IDRI-0002 but if you do, those for two readers should all be blank.

RE your other questions:

  1. You would have to look through the xml file and compare to the image.

  2. Yes, reader is the number of experts/radiologists. So Y is the number of radiologists that annotated that slice of the scan. Train/test splits are highly dependent upon what you want to achieve. Amongst other options, you can consider each annotation as a separate samples or combine them to form one annotation. The latter has some implications that you should be aware of though, I did some research on these that you can find in my article: https://ieeexplore.ieee.org/document/7437472/, the pdf is available on my website.

monjoybme commented 5 years ago

LIDC-IDRI-0002_annotation.zip Please find the attachment which has been generated after running your code. Please match the "slice_correspondences.txt" files with your output. Which I noticed is that DICOM file names are different and the number of GT_idY.tif files is also different. My confusions are: Let's consider the folder "grts" and subfolder "slice1". Here there are 4 files out of that GT_id3.tif and GT_id4.tif have some labels. As you already mentioned these are annotations of four different experts. [1] Then how can we decide which .tif image we should consider? [2] The annotated regions are only nodule regions. Am I right? Is there any chance of having some other regions rather than nodule?

TesterTi commented 5 years ago

I see that the all of the GT_id1.tif and GT_id2.tif in your output are empty, as mentioned before this is because these two readers only annotated < 3mm nodules. I think that in a previous version of the toolbox I just ignored them but changed it later to output blank images as these readers essentially said that nothing > 3mm exists. The sample output must have been calculated before I made this change.

The DICOM filenames in your output have been incremented by 1, e.g. 000125.dcm in your output is 000124.dcm in my output, the rest of the details are the same. Again, this must have been something I have changed since first calculating the sample output. I'll look into that when I get the chance and recalculate the sample output to reflect these changes to avoid confusion in the future.

Regarding question [1]. I'm not sure what you mean by decide which TIFF images you should consider. That is really up to your research goals and what problem you want to solve.

Yes, the annotated regions are only nodules. A description of the annotations of the LIDC dataset can be found here: https://wiki.cancerimagingarchive.net/display/Public/LIDC-IDRI. I am not involved in creating this dataset and my toolbox is only for use for extracting the nodules of >= 3mm but of course it could be easily extended to extract the non-nodule marks if that is what is needed.

monjoybme commented 5 years ago

Thanks a lot