DigitalSlideArchive / digital_slide_archive

The official deployment of the Digital Slide Archive and HistomicsTK.
https://digitalslidearchive.github.io
Apache License 2.0
105 stars 49 forks source link

Feature request: Storing WSI references in database instead of images #254

Closed andreped closed 1 year ago

andreped commented 1 year ago

I have a local database with about 10 TB of WSIs. It is not practical to have duplicates of these inside a separate MongoDB database which DSA uses. The original database is also used for other stuff than DSA, and thus, relying fully on MongoDB only is not a viable option.

I had the impression that DSA supported two ways of storing data:

  1. Upload WSI and store it directly in the MongoDB database (stores a copy of the image)
  2. Upload WSI directly from filesystem where the server is launched (only stores a reference/path to the image)

For option 2 I assumed, perhaps naively, that it only stored the reference to the WSI inside the database and not the entire WSI. However, I just noticed that the collection I was working with had stored ~22 GB of data when the original test cohort was on about ~16.3 GB. Hence, I would think that it stores the full images (in addition to annotations and other stuff). Hence, my understanding is likely wrong.

If you do not storing references in DSA, wouldn't that be a great idea? This is what QuPath does and enables the user to make "infinite" projects without running out of storage. I understand that it is convenient to have the images in the database itself as the DSA solution would be more stable and less prone to external forces, such as the user suddenly moving the images on disk, and then the reference would no longer be valid - resulting in corrupted IDs in the database and all that fun stuff.

For exactly this issue, QuPath has added a feature to enable users to fix broken image paths by enabling them to set the new path manually or asking for a directory of where the WSI should be where QuPath would then run a search to find the missing image. I find this extremely useful in a variety of situations and it is honestly the only practical solution for me.

What are your thoughts on this idea? I don't mean to replace storing images, but rather having an alternative and less storage-hungry alternative.

dgutman commented 1 year ago

So big big point.. We can either IMPORT a file into our local assetstore, OR what I do 99% of the time, is we create indexes/references to files that already exist on a NAS somewhere. In that case, you have to make sure the docker container has been configured so the volume is bind-mounted into the container.

I often have a docker-compose.override.yml file (similar to below), that will mount


version: '3' services: girder: volumes:

For local development, uncomment the set of mounts associated with

the

local source files. Adding the editable egg directories first

allows

allow mounting source files from the host without breaking the

internal

data.

  - /home/dagutman/devel/testData:/testData
  - /myLocalNAS:/myLocalNas

So then under the existing admin console --> Assetstores, there's a magical IMPORT DATA button (in green), once your file system / nas /whatever is "visable" within the main girder docker container, you can then have it walk that file path and it will import the data into the filesystem.. again in this mode you are INDEXING the files and grabbing metadata from the file header, but are NOT copying 10TB of files into a mongo database. As you mentioned, that would be a total disaster.

The only thing the actual MONGO database stores is the actual internal DSA metadata, annotation documents, etc.

The assetstore is a separate folder / resource where things people have "manually" uploaded i.e. via drag & drop through the uI would appear.

[image: image.png]

[image: image.png]

On Sat, Feb 25, 2023 at 5:33 PM André Pedersen @.***> wrote:

I have a local database with about 10 TB of WSIs. It is not practical to have duplicates of these inside a separate MongoDB database which DSA uses. The original database is also used for other stuff than DSA, and thus, relying fully on MongoDB only is not a viable option.

I had the impression that DSA supported two ways of storing data:

  1. Upload WSI and store it directly in the MongoDB database (stores a copy of the image)
  2. Upload WSI directly from filesystem where the server is launched (only stores a reference/path to the image)

For option 2 I assumed, perhaps naively, that it only stored the reference to the WSI inside the database and not the entire WSI. However, I just noticed that the collection I was working with had stored ~22 GB of data when the original test cohort was on about ~16.3 GB. Hence, I would think that it stores the full images (in addition to annotations and other stuff). Hence, my understanding is likely wrong.

If you do not storing references in DSA, wouldn't that be a great idea? This is what QuPath does and enables the user to make "infinite" projects without running out of storage. I understand that it is convenient to have the images in the database itself as the DSA solution would be more stable and less prone to external forces, such as the user suddenly moving the images on disk, and then the reference would no longer be valid - resulting in corrupted IDs in the database and all that fun stuff.

For exactly this issue, QuPath has added a feature to enable users to fix broken image paths by enabling them to set the new path manually or asking for a directory of where the WSI should be where QuPath would then run a search to find the missing image. I find this extremely useful in a variety of situations and it is honestly the only practical solution for me.

What are your thoughts on this idea? I don't mean to replace storing images, but rather having an alternative and less storage-hungry alternative.

— Reply to this email directly, view it on GitHub https://github.com/DigitalSlideArchive/digital_slide_archive/issues/254, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFODTQMZU3DHTFPLNHP5KLWZKCCTANCNFSM6AAAAAAVIDBSVY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- David A Gutman, M.D. Ph.D. Associate Professor of Neurology Emory University School of Medicine

andreped commented 1 year ago

@dgutman OK, but then option (2) is possible which is the ideal scenario! Great!

I followed the steps with the docker compose as you mentioned but maybe something is wrong. I could try to reset the DSA docker container tomorrow.

andreped commented 1 year ago

Just did a test now using reference-based imports, through the Assetstore import data feature. Went really fast. I still notice that the size of the collection has increased.

From the snapshots below you can see the file sizes and the storage required for the database. But if I understand correctly, this number does not represent true storage used within the database?

Screenshot from 2023-02-26 11-18-47 Screenshot from 2023-02-26 11-20-05

manthey commented 1 year ago

@andreped The WSI files are never stored in the database UNLESS you upload them to a GridFS based assetstore. If you upload them, they are stored in the assetstore (typically either a file system or an S3 bucket) under some hash-based naming scheme. If you IMPORT them, it just stores a path reference to the original file, plus some minor metadata (file size, file path or bucket path). The reported size is from that metadata.

andreped commented 1 year ago

Oh, OK. That explains it. Cheers! :]

dgutman commented 1 year ago

Yes those numbers are telling you how big the DATA set is, but the actual database size is.. not that. :-)

On Tue, Feb 28, 2023 at 4:00 PM André Pedersen @.***> wrote:

Closed #254 https://github.com/DigitalSlideArchive/digital_slide_archive/issues/254 as completed.

— Reply to this email directly, view it on GitHub https://github.com/DigitalSlideArchive/digital_slide_archive/issues/254#event-8631422027, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFODTU6O2Y2DEQO54MMDHLWZZRORANCNFSM6AAAAAAVIDBSVY . You are receiving this because you were mentioned.Message ID: <DigitalSlideArchive/digital_slide_archive/issue/254/issue_event/8631422027 @github.com>

-- David A Gutman, M.D. Ph.D. Associate Professor of Neurology Emory University School of Medicine