Complexity and inefficiency of vtkDICOMDirectory::SortFiles()

dgobbi commented 5 years ago

The vtkDICOMDirectory class is responsible for sorting DICOM files according to Patient, Study, Series and Instance. In addition to sorting, it also checks for duplicate InstanceUIDs.

The sorting and checking are done inefficiently: the former is O(n^2) and the latter is O(n). The reason for the inefficiency is that the series and instances are sorted by SeriesNumber and InstanceNumber, but they are identified by SeriesInstanceUID and SOPInstanceUID. Since the series are not sorted by SeriesInstanceUID, a linear search is required in order to find out which series a new file belongs to. Likewise, a linear search is required to check for duplicate SOPInstanceUIDs.

My reason for sorting by "Number" rather than "UID" is simple: the user wants to see things sorted by "Number". Ordering by UID is meaningless to the user.

If I sorted by UID first, and then re-sorted by Number prior to presenting the results to the user, then I would be doing two efficient sorts instead of one inefficient sort. The downside is that I could not present incremental results to the user.

The solution would be to add a parallel set of arrays sorted by UID. So I would have the files sorted by InstanceNumber, and have another array that contained references to items in the first array that was sorted by SOPInstanceUID. Ditto for SeriesInstanceUID. Sorting could then be O(n log n) and checking would be O(log n).

dgobbi commented 5 years ago

Furthermore, the files will often already be in the correct order, since each series is usually kept in its own directory, and the InstanceNumber is usually part of the filename. So I could get the efficiency to be close to O(1) by using these two assumptions:

1) the current file most likely belongs to the same series as the last file.

2) the current file most likely follows the last file in InstanceNumber ordering.

Edit: Speed improvement very minor, is not worth the extra checks that are needed.

dgobbi commented 5 years ago

This comment has to do with semantics rather than efficiency. Grouping of files is strictly hierarchical: Patient, Study, Series, and Instance. This means that if one were to take a file from one study and change the PatientID while keeping the StudyInstanceUID, then vtkDICOMDirectory would be forced to place that file in a different Study than other files with the same StudyInstanceUID.

I wonder if this behavior is, in fact, correct. Should the file stay with its study? Should I group by StudyInstanceUID first, and then by the PatientID that is prevalent in that study?

This situation will only occur if either the PatientID or the StudyInstanceUID is incorrect, but reasonable behavior for incorrect data is a good feature for any software to have.

Edit: the old behavior was kept.

dgobbi / vtk-dicom

Complexity and inefficiency of vtkDICOMDirectory::SortFiles() #175