ImagingDataCommons / dicomweb-client

Python client for DICOMweb RESTful services
https://dicomweb-client.readthedocs.io
MIT License
110 stars 38 forks source link

DICOMfileClient for working with locally stored DICOM Part10 files #56

Closed hackermd closed 2 years ago

hackermd commented 3 years ago

This PR adds the DICOMfileClient, a class that facilitates serverless query, retrieval, and storage of DICOM Data Sets stored locally in DICOM Part10 files. It exposes the same API as the DICOMwebClient.

from dicomweb_client.file import DICOMfileClient

client = DICOMfileClient(base_dir='/path/to/directory')

client.search_for_studies()

client.search_for_series(
    search_filters={'Modality': 'SM', 'PatientName': 'Foo'},
    fuzzymatching=True
)

client.retrieve_instance_frames(..., frame_numbers=[1, 2, 5, 8])

The DICOMfileClient recursively searches for DICOM files in base_dir, reads (a subset) of the metadata of the data sets stored in the files, and indexes the metadata in a SQLite database to enable efficient queries. In addition, it reads individual frames of multi-frame images efficiently by using a Basic Offset Table (BOT) and by only selectively loading requested frames into memory (critical for large SM images!). Last but not least, it can be pickeled and used with Python multiprocessing (for example with a PyTorch DataLoader).

Considerations

The DICOMfileClient depends on the NumPy and Pillow libraries. Therefore, this PR would introduce a dependency on C Python. I don't think this is a major issue and most users will likely have NumPy and Pillow installed anyways.

If this should turn out to be an issue, we could either

hackermd commented 3 years ago

@pieper @dclunie @fedorov this is the client I have been telling you about.

ntenenz commented 3 years ago

Thoughts on introducing either a shared abstract base class or protocol (via typing-extensions) to allow for better type hinting? Given they share an API, users may wish to inject an instance of DICOMwebClient or DICOMfileClient into a callable/class without worrying about which class is being used. An ABC/protocol would potentially save lots of instances of Union[DICOMwebClient, DICOMfileClient].

pieper commented 3 years ago

+1 for an abstract base class.

Could also make sense to use this to provide a python implementation of precomputed dicomweb results in files/buckets as discussed with @chafey.

hackermd commented 3 years ago

Thoughts on introducing either a shared abstract base class or protocol (via typing-extensions) to allow for better type hinting? Given they share an API, users may wish to inject an instance of DICOMwebClient or DICOMfileClient into a callable/class without worrying about which class is being used.

Yes, that's exactly the motivation for the file client.

Given your preference of using a shared base class or protocol, I assume you favor adding the file client to this package.

I see pros and cons for subtyping and structural subtyping (PEP 544). However, I recently got bitten a few times by inheritance in Python and would thus prefer typing.Protocol.

An ABC/protocol would potentially save lots of instances of Union[DICOMwebClient, DICOMfileClient].

What should the protocol be called? What about DICOMwebProtocol?

hackermd commented 3 years ago

Could also make sense to use this to provide a python implementation of precomputed dicomweb results in files/buckets as discussed with @chafey.

Absolutely! I have experimented with this already and it should be straight forward to implement a DICOMs3Client that would be pointed to a bucket.

The only part that's a bit tricky from a performance perspective is reading and indexing DICOM metadata. A DICOMs3Client would need to first download the objects from the bucket to get the values of metadata elements (StudyInstanceUID, SeriesInstanceUID, etc.) and to build the Basic Offset Table if it is not included in the Pixel Data element.

hackermd commented 3 years ago

@cpbridge @agharwal do you have any thoughts?

ntenenz commented 3 years ago

Thoughts on introducing either a shared abstract base class or protocol (via typing-extensions) to allow for better type hinting? Given they share an API, users may wish to inject an instance of DICOMwebClient or DICOMfileClient into a callable/class without worrying about which class is being used.

Yes, that's exactly the motivation for the file client.

Given your preference of using a shared base class or protocol, I assume you favor adding the file client to this package.

I am, though I do see the merit in making numpy/pillow opt-in, in particular pillow.

I see pros and cons for subtyping and structural subtyping (PEP 544). However, I recently got bitten a few times by inheritance in Python and would thus prefer typing.Protocol.

No preference on ABC vs Protocol.

An ABC/protocol would potentially save lots of instances of Union[DICOMwebClient, DICOMfileClient].

What should the protocol be called? What about DICOMwebProtocol?

DICOMStore or some variant?

ntenenz commented 3 years ago

Could also make sense to use this to provide a python implementation of precomputed dicomweb results in files/buckets as discussed with @chafey.

Absolutely! I have experimented with this already and it should be straight forward to implement a DICOMs3Client that would be pointed to a bucket.

To avoid creating a significant number of implementations, in particular because there may be interest from Azure and GCP, it may be easier to have a single version of this client and inject a "flavor" argument upon creation.

A logical next step, assuming the part 10 implementation is merged, may also be to consider creatong a hybrid interface which accepts two or more clients. Use cases could include query federation and caching of fetched studies.

pieper commented 3 years ago

The only part that's a bit tricky from a performance perspective is reading and indexing DICOM metadata

Can all this be pre-calculated? If the metadata is big perhaps the client can use range requests to implement overlapped requests for improved performance.

ntenenz commented 3 years ago

As an aside @hackermd, it may also be useful to start thinking about module/package organization if the number of implementations will continue to grow. I don't think it's vital now per se, but it may make things easier/cleaner in the future.

hackermd commented 3 years ago

To avoid creating a significant number of implementations, in particular because there may be interest from Azure and GCP, it may be easier to have a single version of this client and inject a "flavor" argument upon creation.

I have considered and explored that initially, but found it messy because the web and file services/protocols are quite different (HTTP versus POSIX) and require different configuration (authentication, authorization, etc.).

A logical next step, assuming the part 10 implementation is merged, may also be to consider creatong a hybrid interface which accepts two or more clients. Use cases could include query federation and caching of fetched studies

In our experience, once the client objects have been created, they behave similarly independent of their type (in the spirit of structural duct-typing).

So far, we have been annotating them as Union[DICOMwebClient, DICOMfileClient]. Moving forward, we could use DICOMwebProtocol. Wouldn't that be sufficient or do you think we would need a wrapper class?

ntenenz commented 3 years ago

I have considered and explored that initially, but found it messy because the web and file services/protocols are quite different (HTTP versus POSIX) and require different configuration (authentication, authorization, etc.).

To be clear, I was only referring to the pre-computed DICOMweb results in blobs that @pieper was referring to. My hunch is that there'd be a high degree of similarity between cloud vendors, though I certainly could be mistaken.

A logical next step, assuming the part 10 implementation is merged, may also be to consider creating a hybrid interface which accepts two or more clients. Use cases could include query federation and caching of fetched studies

In our experience, once the client objects have been created, they behave similarly independent of their type (in the spirit of structural duct-typing).

So far, we have been annotating them as Union[DICOMwebClient, DICOMfileClient]. Moving forward, we could use DICOMwebProtocol. Wouldn't that be sufficient or do you think we would need a wrapper class?

This would be more about ease of use and less about typing. Imagine I'd like to fetch a study, preferentially from a cache if I already have it on disk. It could be implemented as follows:

web_client = DICOMwebClient(...)
file_client = DICOMfileClient(...)

try:
    out = file_client.retrieve_study(...)
except FileNotFoundError:
    out = web_client.retrieve_study(...)

Doing this repeatedly could become quite verbose, whereas a thin wrapper could make this more ergonomic.

hackermd commented 3 years ago

The only part that's a bit tricky from a performance perspective is reading and indexing DICOM metadata

Can all this be pre-calculated? If the metadata is big perhaps the client can use range requests to implement overlapped requests for improved performance.

An earlier version of this client actually used JSON files (formatted according to the DICOM JSON model and structured according to the DICOMweb Query resource definitions), which were stored alongside the DICOM files (similar to DICOMDIR files).

This approach works fine for reading, but it doesn't work well for writing (store_instances()) in our experience, especially with concurrent writes.

For now, I prefer keeping the database, which contains the data pointers and indices, separate from the data and not assume the data to be organized in a specific way. The implementation of the database currently is fully abstracted via the DICOMfileClient Python API and implementation details are not exposed to the user.

The downside of this approach is that every user has to index the data. The upside is that we don't need to standardize the structure of the indices. That is something we could bring to the DICOM working groups, though.

hackermd commented 3 years ago

Doing this repeatedly could become quite verbose, whereas a thin wrapper could make this more ergonomic.

I like the idea. However, the exact data access logic may be highly application specific (e.g., try to fetch over web first or use a different client depending on the modality) and an abstraction layer may incur significant performance overhead. Therefore, I am not sure whether such wrappers should be provided by the library or rather implemented by applications.

CPBridge commented 3 years ago

In favour of including in this package. For me this seems like a case well suited to structural subtyping (protocols), unless parts of the implementation would also be shared in which case there may be a case for inheritance. However wouldn't use of typing.Protocol require Python >= 3.8 ?

No strong opinions on the cpython requirement, I can certainly see the case to avoid requiring it though I can't imagine too many uses cases where users would be using this package and never touching pixel data. Having numpy and pillow as dependencies may have other advantages too - such as when retrieving rendered frames.

ntenenz commented 3 years ago

Doing this repeatedly could become quite verbose, whereas a thin wrapper could make this more ergonomic.

I like the idea. However, the exact data access logic may be highly application specific (e.g., try to fetch over web first or use a different client depending on the modality) and an abstraction layer may incur significant performance overhead. Therefore, I am not sure whether such wrappers should be provided by the library or rather implemented by applications.

@hackermd , the way I was envisioning it, the user would provide an ordered sequence of clients, e.g.

file_client = DICOMfileClient(...)
web_client = DICOMwebClient(...)
web_client2 = DICOMwebClient(...)

# only go to the web if it's not present locally
caching_client = ComposableDICOMClient([file_client, web_client], sequential=True) 
# parallelize requests in a federated PACS setup
federated_client = ComposableDICOMClient([web_client, web_client2], sequential=False)
# and of course, they're composable themselves
caching_federated_client = ComposableDICOMClient([file_client, federated_client], sequential=True)

If we wanted to get especially crazy, we could add logic (or the ability for the user to insert logic) on how/when the client should write to the cache.

In favour of including in this package. For me this seems like a case well suited to structural subtyping (protocols), unless parts of the implementation would also be shared in which case there may be a case for inheritance. However wouldn't use of typing.Protocol require Python >= 3.8 ?

@CPBridge, typing-extensions backports Protocol to earlier versions of python.

hackermd commented 2 years ago

@ntenenz I've now implemented the DICOMwebProtocol class based on typing.Protocol and decorated the class with typing.runtime_checkable. This now allows

from dicomweb_client.api import DICOMwebClient, DICOMfileClient
from dicomweb_client.protocol import DICOMwebProtocol

web_client = DICOMwebClient(...)
assert isinstance(web_client, DICOMwebProtocol)

file_client = DICOMfileClient(...)
assert isinstance(file_client, DICOMwebProtocol)
hackermd commented 2 years ago

The following four methods of the DICOMfileClient class currently raise a NotImplementedError:

Access of bulkdata is a bit tricky. Currently, all elements other than Pixel Data, Float Pixel Data, and Double Float Pixel Data are included in metadata. We could create URLs (using a hash other another method), include the URL as BulkDataURI into the metadata, and allow the caller to get them via retrieve_bulkdata(). However, I am not sure that would be worth the effort. Will need to think about it. @pieper any thoughts?

Rendering series and instances will require more work and thought. It's currently not clear to me what the exact behavior should be. We can certainly support rendering single-frame image instances via retrieve_instance_rendered(), but I am not sure how a multi-frame image instance or a series of single-frame image instances would be rendered. @dclunie do you have some insight into that?

hackermd commented 2 years ago

+1 for an abstract base class.

Could also make sense to use this to provide a python implementation of precomputed dicomweb results in files/buckets as discussed with @chafey.

I created issue #57 to track that feature.

pieper commented 2 years ago

If I understand correctly, it seems to me that the BulkDataURI for the file client introduces a logical headache. It seems the standard would assume that you can access that data using web network transactions without the need of a client. But if your goal is to have the file client be a drop in replacement you would need to require that users of the client always resolve these URIs via the client. That's not a terrible requirement, but it's a slightly different paradigm.

ntenenz commented 2 years ago

The following four methods of the DICOMfileClient class currently raise a NotImplementedError:

  • retrieve_bulkdata()
  • iter_bulkdata()
  • retrieve_series_rendered()
  • retrieve_instance_rendered()

Access of bulkdata is a bit tricky. Currently, all elements other than Pixel Data, Float Pixel Data, and Double Float Pixel Data are included in metadata. We could create URLs (using a hash other another method), include the URL as BulkDataURI into the metadata, and allow the caller to get them via retrieve_bulkdata(). However, I am not sure that would be worth the effort. Will need to think about it. @pieper any thoughts?

Rendering series and instances will require more work and thought. It's currently not clear to me what the exact behavior should be. We can certainly support rendering single-frame image instances via retrieve_instance_rendered(), but I am not sure how a multi-frame image instance or a series of single-frame image instances would be rendered. @dclunie do you have some insight into that?

IMO, if a method will raise a NotImplementedError, the class doesn't truly implement the protocol. If user code accepts an object whose implementation adheres to the protocol and only leverages methods found within it, it would be a reasonable expectation for that code to successfully complete in the absence of other environmental/logic errors.

Perhaps consider narrowing the definition of the protocol and creating more expansive versions in the future?

Note: I'm replying from my phone, so I apologize if the file client indeed implements all the methods in the protocol.

hackermd commented 2 years ago

IMO, if a method will raise a NotImplementedError, the class doesn't truly implement the protocol. If user code accepts an object whose implementation adheres to the protocol and only leverages methods found within it, it would be a reasonable expectation for that code to successfully complete in the absence of other environmental/logic errors.

I agree with you and think we should implement the methods before merging.

However, I also think it's reasonable for the serverless implementation to raise exceptions that mimicstatus codes. For the retrieve_bulkdata(), we could always return the equivalent of a 404 Not Found.

We could provide an abstraction layer for HTTP errors (e.g., ClientError and ServerError) and make those (conceptually) part of the DICOMwebProtocol.

hackermd commented 2 years ago

Alternatively, the methods of the DICOMfileClient could raise requests.HTTPError.

ntenenz commented 2 years ago

IMO, if a method will raise a NotImplementedError, the class doesn't truly implement the protocol. If user code accepts an object whose implementation adheres to the protocol and only leverages methods found within it, it would be a reasonable expectation for that code to successfully complete in the absence of other environmental/logic errors.

I agree with you and think we should implement the methods before merging.

However, I also think it's reasonable for the serverless implementation to raise exceptions that mimicstatus codes. For the retrieve_bulkdata(), we could always return the equivalent of a 404 Not Found.

We could provide an abstraction layer for HTTP errors (e.g., ClientError and ServerError) and make those (conceptually) part of the DICOMwebProtocol.

It's fairly common to have hierarchies of Protocols which extend the functionality of each other. If one or more methods of an implementation will always raise an exception, regardless of input, it suggests (to me) that the class is likely implementing a more narrowly defined Protocol. Conversely, if you're suggesting that the exception will be raised for certain inputs based on the "server's" state, that to me feels very different.

hackermd commented 2 years ago

IMO, if a method will raise a NotImplementedError, the class doesn't truly implement the protocol. If user code accepts an object whose implementation adheres to the protocol and only leverages methods found within it, it would be a reasonable expectation for that code to successfully complete in the absence of other environmental/logic errors.

I agree with you and think we should implement the methods before merging. However, I also think it's reasonable for the serverless implementation to raise exceptions that mimicstatus codes. For the retrieve_bulkdata(), we could always return the equivalent of a 404 Not Found. We could provide an abstraction layer for HTTP errors (e.g., ClientError and ServerError) and make those (conceptually) part of the DICOMwebProtocol.

It's fairly common to have hierarchies of Protocols which extend the functionality of each other. If one or more methods of an implementation will always raise an exception, regardless of input, it suggests (to me) that the class is likely implementing a more narrowly defined Protocol. Conversely, if you're suggesting that the exception will be raised for certain inputs based on the "server's" state, that to me feels very different.

I have "implemented" the retrieve_bulkdata() method (7a8fef7cb96f04d81c0d046689b9695b95abdd6d), although it currently always raises an IOError stating that the requested resource was not found. That is the correct behavior, though, because the retrieve_instance_metadata() (and friends) include all bulkdata inline by value (using "InlineBinary") rather than by reference (using "BulkdataURI").

The retrieve_series_rendered() and retrieve_instance_rendered() are trickier. I have now implemented rendering of a single-frame image (7a8fef7cb96f04d81c0d046689b9695b95abdd6d). However, I have no idea what the behavior should even be for retrieval of a series of single-frame instances or a multi-frame instances would be. All the standard says in Part 18 Section 10.4.1.1.3 Rendered Resources is the following:

Rendered Resources (defined in Table 10.4.1-3) are used to retrieve representations of a DICOM Resource rendered as appropriate images, videos, text documents, or other representations. Its primary use case is to provide user agents with a simple means to display medical images and related documents, without requiring deep knowledge of DICOM data structures and encodings.

That's very broad and I imagine the expected behaviour would be dependent on the modality and the acceptable media types. Since the DICOMwebClient exposes those methods, I would like to keep them in the DICOMfileClient for completeness. They are currently only useful for a small subset of all potential requests, but we can further implement additional functionality moving forward. To be fair, most DICOMweb server implementations I've come across also just return a client or server error if something is not supported.