Support storage of binary data outside the relational database

MFAshby commented 3 years ago

Is your feature request related to a problem? Please describe. I want to store lots of DocumentReference FHIR resources with corresponding Attachments which might be large PDFs.

I don't want to store an large binary attachments inside my relational database, as it's not optimized for such a use case. Instead I would like to use a dedicated object storage system (e.g. AWS S3 or google cloud storage) for such data.

Describe the solution you'd like I would like some pluggable configuration for large binary resource storage, separate from the relational database, with implementations for popular cloud infrastructure providers.

Describe alternatives you've considered Some database backends like postgresql support choosing particular storage devices for particular tables, so it might be possible to provision different classes of storage for such large binary objects separate from the rest of the relational data. Postgresql also supports large object storage (BLOB).

In some circumstances it is not possible for the application developer to provision such tablespaces e.g. when using managed databases or when the database administrators do not permit such configuration.

Acceptance Criteria

GIVEN IBM FHIR server configured to store binary data in blob storage. WHEN a large attachment resource is received. THEN binary data is stored in the blob storage. AND attachment can be read back from the FHIR server with GET request.

Additional context FHIR Attachment resource provides two alternative mechanisms for transferring the binary data. This can either be inline encoded into the data field, or stored externally via the url field. Both present some challenges:

Large attachments transmitted via the data field may take some time to serialize and deserialize and transfer over a network connection. This can be undesirable when processing a large data feed.
Attachments transmitted via the url field must be fetched separately via some non-FHIR mechanism, which must be authorized separately. One possible mechanism is the use of 'signed' URLs to allow time-limited access by the recipient of the URL.

Some alternative FHIR server implementations e.g. AidBox support this kind of mechanism already.

lmsurpre commented 3 years ago

Hi @MFAshby and thanks for the enhancement request. For this use case I recommended using Attachments that link to the data through the url field. But your point about that being difficult to authorize is a good one.

@punktilious This one might be good for us to consider when we go to tackle #1869

MFAshby commented 3 years ago

@lmsurpre thanks for the speedy response!

The scenario I have in mind is federating data from various upstream providers (who are sending FHIR resources) and then providing that data to our end users.

It's unlikely that upstream providers feeding into our system will be able to make attachment URLs visible to our own users, and I think it makes sense for us to be able to ingest that data into our own storage (in case the upstream can't keep it available forever)

MFAshby commented 3 years ago

Might it be possible to implement this using FHIRPersistenceInterceptor?

This could

intercept the incoming event to create the attachment
create a blob in the relevant cloud storage
if the inbound attachment has a data element, simply deserialize and persist to the blob.
else if the inbound attachment has a url element, download the data from the provided URL and persist to the blob. This could be asynchronous / deferred if necessary for improved performance.
replace the original data or url element with a url for the blob.

On the outbound side, would be prudent to transform the plain url for the blob with a time-limited signed URL, but I can't see an appropriate pluggable interception point for this kind of transformation at the moment.

lmsurpre commented 3 years ago

I like this line of thinking. Interceptors definitely ARE able to modify the resources on the way in (via the beforeX methods in the interceptor), which I think was always the design but wasn't working right in our R4 server implementation until https://github.com/IBM/FHIR/issues/2369.

I believe that interceptors are also able to modify resources on the way out, so I think it should be possible to replace the URLs on the way out (using the afterX methods in the same interceptor) with temporary/presigned ones. Please let us know if you give this a try and it isn't working as expected.

Finally, as an aside, we do have some config+code for generated presigned temporary urls in our bulk data export feature. For example, in https://github.com/IBM/FHIR/blob/main/operation/fhir-operation-bulkdata/src/main/java/com/ibm/fhir/operation/bulkdata/model/url/DownloadUrl.java#L108. It might be possible to refactor some of that to be common and/or just use it as an example.

punktilious commented 3 years ago

Some of this has been prototyped already in one of my development branches, but that was to offload the storage of our entire data blob, whereas the requirements expressed above are a little more nuanced. It's certainly possible, but the question is what can be done within the constraints of the FHIR spec (and how to handle scenarios such as bulk export etc).

lmsurpre commented 2 years ago

potentially related: #2799

LinuxForHealth / FHIR

Support storage of binary data outside the relational database #2679