aws-samples / aws-healthimaging-samples

Sample projects on working with AWS HealthImaging, an AWS service that allows you to store, analyze, and share medical images in the cloud at petabyte scale.
https://aws.amazon.com/healthimaging
MIT No Attribution
54 stars 101 forks source link

Duplicate image sets !!! #55

Closed anandhur-karnavar closed 10 months ago

anandhur-karnavar commented 11 months ago

Hi,

I am new to the HealthImaging service and trying it out for a personal project. I was trying to import some of my sample dicom files from my S3 to the healthImaging service. I was able to do the import with the help of the import job successfully through the console, but the issue here is when I run one more import job over the same files in S3 it will again create the same image set that it created before as a duplicate.

I got wondered why the service is again importing the same image set which I imported just before. The duplicated one have exact same "version" also. I was expecting some duplication prevention (De-dup) algorithm in the service. As per my understanding based on the official docs(please correct me if I am wrong) there is only option to specify the source S3 bucket with the prefix in the Import job rather than specifying the dicom files for the import job. So if I want to import a new file from the same S3 prefix folder, I have to either remove the old one from there to prevent duplicate image set or process all of the together(Old+new) such that I will have duplicate image sets in my data store.

Please help me out to figure out whether it is a feature or issue. Thank you.

HealthImaging Service Issue
awsjpleger commented 10 months ago

Hi anandhur-karnavar What you observed is a normal behavior of the HealthImaging service. HealthImaging groups DICOM instances of a same DICOM series into one ImageSets, however the grouping happens in the scope of 1 import job. To import the same DICOM data a 2nd time will create deduplication of the data, and the creation of another ImageSet for each DICOM series re-imported. We recommend that you create a specific S3 location to prepare Import jobs, copy the data to import into it, and then clear this S3 location after the import. Regarding the deduplicated data, you can manage it via the modification API calls . For instance you could decide to delete theses ImageSets.

I hope it helps.