MIC-DKFZ / LIDC-IDRI-processing

Scripts for the preprocessing of LIDC-IDRI data
MIT License
75 stars 18 forks source link

DICOM Metadata retained #3

Closed ivanwilliammd closed 5 years ago

ivanwilliammd commented 5 years ago

Hello Sir @MiGoetz, thank you for your marvelous work Does this code is support conversion of DICOM file to NRRD which doesn't have XML? I have many scattered DICOM files (in folders) without DICOM metadata files but automatically arranged when I opened it on "MicroDicom" or 'Aliza' The original annotation given by the radiologist team in excel file contain x coordinate, y coordinate, z coordinate, and type of nodule (solid, subsolid, groundglass).

Thanksin advance Sir

ivanwilliammd commented 5 years ago

By the way, after trying to convert first 10 CT scan dataset from LIDC in Windows 10, I got characteristics.csv output which is full of header maybe because there are many DICOM file I don't download, so the script append multiple header instead of data

Patient_ID;Session_ID;Radiologist;Nodule_Str;subtlety;internalStructure;calcification;sphericity;margin;lobulation;spiculation;texture;malignancy

And then I got error log :

Unspecific error in file : D:\LIDC-IDRI\XML\tcia-lidc-xml\185\069.xml Unspecific error in file : D:\LIDC-IDRI\XML\tcia-lidc-xml\185\071.xml Unspecific error in file : D:\LIDC-IDRI\XML\tcia-lidc-xml\185\072.xml Failed due to wrong segmentations: 0002a_2_00000030 Failed due to wrong segmentations: 0007a_1_00000126

In which 0002a_2_00000030 and 0007a_1_00000126 are still written on characteristics.csv, but the NRRD file aren't made. Do you have some trick to this problem? Thanks in advance Sir

MiGoetz commented 5 years ago

Hello Sir @MiGoetz, thank you for your marvelous work

Thank you. Good to hear that somebody is using it.

Does this code is support conversion of DICOM file to NRRD which doesn't have XML?

Yes, it is possible to use this code to load DICOM files and save them as NRRD. However, in order to do this you need to rewrite some parts of the program.

I have many scattered DICOM files (in folders) without DICOM metadata files but automatically arranged when I opened it on "MicroDicom" or 'Aliza'

That should be no problem. If the files can be read by common programs, they are most likely valid DICOM files. Those files contain some Metainformation (for example, which files belong together and so on). So it should be possible to read them.

The original annotation given by the radiologist team in excel file contain x coordinate, y coordinate, z coordinate, and type of nodule (solid, subsolid, groundglass).

That's a bit tricky. If I understand correct, the annotation is a single point within the lesion you are looking at? In this case, what do you want as output? Currently, the script is expecting a contour. If you have an contour, it would again be possible to adapt the script so that it takes a different input. But that would require some programming on your part. (But i think it should be straight forward once you see how i did it. )

MiGoetz commented 5 years ago

By the way, after trying to convert first 10 CT scan dataset from LIDC in Windows 10, I got characteristics.csv output which is full of header maybe because there are many DICOM file I don't download, so the script append multiple header instead of data

Could you open a second issue and give me some more details on this behaviour. I hope I can look into this issue next week.

And then I got error log :

Unspecific error in file : D:\LIDC-IDRI\XML\tcia-lidc-xml\185\069.xml Unspecific error in file : D:\LIDC-IDRI\XML\tcia-lidc-xml\185\071.xml Unspecific error in file : D:\LIDC-IDRI\XML\tcia-lidc-xml\185\072.xml Failed due to wrong segmentations: 0002a_2_00000030 Failed due to wrong segmentations: 0007a_1_00000126

In which 0002a_2_00000030 and 0007a_1_00000126 are still written on characteristics.csv, but the NRRD file aren't made. Do you have some trick to this problem?

The errors you got are most likely due to some errors in the XML-files. I cannot fix them, as they are part of the LIDC-dataset, however this shouldn't affect the functionality of the script. Especially the "Failed due to wrong segmentations" indicate that segmentations given in the XML are most likely corrupt, so that is the reason that no segmentation is written.

The script writes the characteristics for ALL lesions defined in the XML-script into the csv-file, even if the corresponding lesion has an invalid annotation or if the lesion is too small and had therefore not been contoured by the annotator. So it is perfectly fine to have more entries in the file than actual segmentations.

Do you get any output image files?

ivanwilliammd commented 5 years ago

Thank you Sir Michael for your answer, my private dataset actually just contain only one characteristics for it's texture only (solid, subsolid, groundglass).

The errors you got are most likely due to some errors in the XML-files. I cannot fix them, as they are part of the LIDC-dataset, however this shouldn't affect the functionality of the script. Especially the "Failed due to wrong segmentations" indicate that segmentations given in the XML are most likely corrupt, so that is the reason that no segmentation is written.

The script writes the characteristics for ALL lesions defined in the XML-script into the csv-file, even if the corresponding lesion has an invalid annotation or if the lesion is too small and had therefore not been contoured by the annotator. So it is perfectly fine to have more entries in the file than actual segmentations.

Do you get any output image files?

Regarding the error, I check the output Nifty image and some of them generated nrrd & nii.gz, however some of them doesn't generate nii.gz file, Is it ok?

ivanwilliammd commented 5 years ago

By the way, this is another thread I post, about how much time needed for LIDC-IDRI conversion. In my case it tooks around 3x24 hours, is it normal Sir? Can the preprocessing use CUDA core instead of CPU

Link to thread --> https://github.com/MIC-DKFZ/LIDC-IDRI-processing/issues/4

ivanwilliammd commented 5 years ago

By the way, after trying to convert first 10 CT scan dataset from LIDC in Windows 10, I got characteristics.csv output which is full of header maybe because there are many DICOM file I don't download, so the script append multiple header instead of data

Could you open a second issue and give me some more details on this behaviour. I hope I can look into this issue next week.

Regarding incompatible header name and multiple spamming header when preprocessed by medicaldetection toolkit code, I solve it by writing the header before writing the data such as this:

nodule_id = 0
with open(path_to_characteristics,"a") as file:
file.write(";".join(["PatientID","SessionID","Radiologist","NoduleID","Subtlety","InternalStructure","Calcification","Sphericity","Margin","Lobulation","Spiculation","Texture","Malignancy"])+"\n") 
for xml_file in glob.glob(os.path.join(path_to_xmls,"*","*.xml")):
    # global path_to_characteristics
    os.makedirs(os.path.dirname(path_to_characteristics), exist_ok=True)
    file=open(path_to_characteristics,"a")
        # file.write(";".join(["PatientID","SessionID","Radiologist","NoduleID","Subtlety","InternalStructure","Calcification","Sphericity","Margin",
            # "Lobulation","Spiculation","Texture","Malignancy"])+"\n") 
    print(xml_file)
    try:
        parse_xml_file(xml_file)
    except:
        write_error("Unspecific error in file : " + xml_file)
MiGoetz commented 5 years ago

That's a good solution, I will transfer it to the repository, thank you.

ivanwilliammd commented 5 years ago

That's a good solution, I will transfer it to the repository, thank you.

You're welcome Sir Michael