Booritas / slideio

BSD 3-Clause "New" or "Revised" License
49 stars 2 forks source link

Very slow performance when reading block from SVS created by GT 450 #9

Closed bnapora closed 1 year ago

bnapora commented 1 year ago

We're having an issue reading blocks from SVS images created by Leica GT 450 scanners. The "scene.read_block" method is taking significantly longer then "open_slide.read_region" method, when reading random patches of the same width/height from an SVS created on GT 450. However when reading from SVS file created by an older Aperio Image Library (eg v12.0.16) the speed is comparable or even slightly faster for SlideIO.

I've attached a quick comparison I made of performance difference between OpenSlide versus SlideIO (the OpenSlide & SlideIO columns show the elapsed time to retrieve 100 patches from the same image): image

If there is any logging or something I could do to help track down the issue I'd be happy to do it.

If interested in the code used for comparison here it is:

random.shuffle(svs_files)
for image_path in tqdm(svs_files[:num_slides_to_test], total=num_slides_to_test):
    # Load OpenSlide
    open_slide = OpenSlide(str(image_path))

    # Load SlideIo
    io_slide = slideio.open_slide(str(image_path), "SVS"  ) 
    scene = io_slide.get_scene(0)

    num_levels = open_slide.level_count
    width, height = open_slide.level_dimensions[0]

    # load 100 random patches 
    start = timer()
    for i in range(num_examples):
        x, y = random.randint(0, width-patch_width-1), random.randint(0, height-patch_height-1)
        open_slide.read_region((x, y), level, (patch_width, patch_height))    
    elapsed_open_slide = timer() - start

    start = timer()
    for i in range(num_examples):
        x, y = random.randint(0, width-patch_width-1), random.randint(0, height-patch_height-1)
        scene.read_block(rect=(x, y, patch_width, patch_height))
    elapsed_slide_io = timer() - start

    results.append((image_path.parent, image_path.stem, num_levels, width, height, elapsed_open_slide, elapsed_open_slide, elapsed_slide_io, io_slide.raw_metadata.split("\r\n")[0]))
Booritas commented 1 year ago

Thank you for your message. Unfortunately, I don't have any slides from Leica GT 450 scanner and cannot comment why the performance of the library is low. Would it be possible to share such a file, so I can investigate the problem? Best regards, Stanislav

bnapora commented 1 year ago

We can share some GT450 images. Below are links for 2 images. If need more, let me know.

https://gestaltdiag.sharepoint.com/:u:/s/GestaltAI/EVpvFyBgIGdKmtF_9K41K0cB9djbDZzR7nWxpqExOmUfLQ?e=9gNIQR https://gestaltdiag.sharepoint.com/:u:/s/GestaltAI/EQ2e_DKUjQBHrbhEVOIvy7oBxDmmZheU4dncXQ82a7eJcg?e=qVB4K9

Brian

Booritas commented 1 year ago

Thanks, for the slides. I downloaded them. You can remove them from the share. I confirm that the performance of the slideio on the slides is low. I will investigate the issue and let you know as soon as I have some information or fix. Best regards, Stanislav

bnapora commented 1 year ago

Great! Let me know if need additional images.

Booritas commented 1 year ago

Hi Brian, I investigated the problem and found a solution. Currently, both libraries deliver the same performance. SlideIO even a bit faster, but only in a few percents. I'll prepare a fix soon. I expect to publish a new version at the latest next week. I'll let you know when it is available. Best, Stanislav

bnapora commented 1 year ago

Great! I look forward to testing the modification. Out of curiosity...what about the GT 450 files was causing the issue? Also...have you had any complaints about initial image load time for DICOM files? In our tests it is taking significantly longer to do the initial load.

Booritas commented 1 year ago

I just published a new version (2.0.1) that should solve the performance problem. The problem was caused by a redundant call to the setCurrentDirectory libtiff function by reading of the tiles. I did not expect that such a call can cause some problem when the current directory is already set.

I'm not sure why it appeared on GT 450 slides and not visible on others. Maybe the slides have more complex tiff structure. Anyway, now the reading is faster. I expect that the fix improves performance for other slides as well.

I would be very interested in the results of your testing.

I will check what happens to DICOM files and let you know. My current assumption is: during the first reading, the library loads DICOM dictionary.

If you like the library, please consider giving a star to the repository.

Best regards, Stanislav

bnapora commented 1 year ago

Excellent. Just tested the update on the GT 450 samples and access is significantly faster. I agree it's weird the redundant call only manifests on those slides. Thanks for finding and applying a fix so quickly.

Regarding the DICOM. When you say "the first reading" is loading the DICOM dictionary, do you mean the DICOM metadata? If that's what you mean, then there may be an issue, because no metadata is being returned in the raw_metadata function. I've been experimenting with slideio for opening dicom files, and I've had to use pydicom to parse the metadata out (which works fine as long as I tell it not to load pixels. =), and then open pixels with slideio.

Booritas commented 1 year ago

Thanks for the info and the images. I will check them. What I meant by the dictionary is this not the metadata. The dictionary has to be loaded before working with the slides. It has to be done once.

Anyway, I just realized that metadata for DICOM files is implemented but because of a small bug it is not accessible from python.

Give me a couple of days and I'll fix it.

bnapora commented 1 year ago

Thanks for the info on the DataDict. Do you use it to open the image? If not, is it possible to make loading the DataDict optional? (It looks like DCMTK has an option to "--disable-default-dict").

Brian

From: Booritas @.> Sent: Friday, February 10, 2023 9:59 AM To: Booritas/slideio @.> Cc: Brian Napora @.>; Author @.> Subject: Re: [Booritas/slideio] Very slow performance when reading block from SVS created by GT 450 (Issue #9)

CAUTION: This email originated from outside Gestalt Diagnostics. Do not click links or open attachments unless you recognize the sender and know the content is safe.

Thanks for the info and the images. I will check them. What I meant by the dictionary is thishttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsupport.dcmtk.org%2Fdocs%2Ffile_datadict.html&data=05%7C01%7Cbnapora%40gestaltdiagnostics.com%7C40a28b81871f454d514208db0b907c57%7C1fa00da65761447fb36082196a95c317%7C1%7C0%7C638116487436251094%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Ulev7fHkWtgQCvDSV8AlfuS6E87bFzzREEKM40zzDC0%3D&reserved=0 not the metadata. The dictionary has to be loaded before working with the slides. It has to be done once.

Anyway, I just realized that metadata for DICOM files is implemented but because of a small bug it is not accessible from python.

Give me a couple of days and I'll fix it.

- Reply to this email directly, view it on GitHubhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FBooritas%2Fslideio%2Fissues%2F9%23issuecomment-1426151352&data=05%7C01%7Cbnapora%40gestaltdiagnostics.com%7C40a28b81871f454d514208db0b907c57%7C1fa00da65761447fb36082196a95c317%7C1%7C0%7C638116487436251094%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=BaulHU760TYjRkcgxOfObFyOjacYoxGUeWbQr9Wi5M8%3D&reserved=0, or unsubscribehttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAH6SIZIYJKHC3V4RDSSPSH3WWZ6WFANCNFSM6AAAAAAUFWCBHE&data=05%7C01%7Cbnapora%40gestaltdiagnostics.com%7C40a28b81871f454d514208db0b907c57%7C1fa00da65761447fb36082196a95c317%7C1%7C0%7C638116487436251094%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=f1zFKYvr%2F%2B3FaVWyRq%2F8UBx8eb0doD2b%2BGxmPV7Wpx4%3D&reserved=0. You are receiving this because you authored the thread.Message ID: @.**@.>>

CONFIDENTIALITY NOTICE: This e-mail message, including any attachments, contains confidential and/or privileged information and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is strictly prohibited. If you are not the intended recipient and have received this e-mail in error, please contact the sender immediately by reply e-mail and destroy all copies of the original message. Thank you.

Booritas commented 1 year ago

I did some investigations on the DICOM driver performance. It looks like dictionary loading does not affect the performance. I noticed, as well, that checking if the file is a DICOMDIR file, takes too long. I optimized the checking. I expect that the file opening now is twice as fast as before.

Additionally, scene object exposes get_raw_metadata method which returns a JSON representation of the DICOM metadata include all tags. Do you think such representation (JSON) is ok?

Please let me know if the performance of the driver is acceptable now. I would be interested what image format you use in your practice. Is there any format not supported by SlideIO? Would you be interested in formats like Leica lif or Ventana bif?

Thank you for your help! The version 2.0.2 is available now for downloading.

P.S. I already downloaded your DICOM file, you can remove the sharing.

bnapora commented 1 year ago

Nice...the DICOM file does appear to load faster now. Is there still an option to specify a DICOMDIR? (I imagine some users may find this useful.)

Dicom Metadata - I attempted to view metadata on first "scene" of Dicom file I provided as sample, using scene.get_raw_metadata, the method did return an object but there was no metadata in it. Were you able to return metadata for that slide?

Additional File Types - I am interested in Ventana bif format (although don't have much sample data.). Additionally I'm interested in 3dHistech MRXS format (MIRAX). Openslide does a decent job with this, but would be nice if slideio handled it as well. The other format we see is NDPI, which slideio currently supports. However, in my limited testing of NDPI, the slideio read_block method is slower on NDPI, then DICOM or SVS. (Not sure if this is an NDPI thing or not.) The other format I'm starting to see a little more of in the US is OME-TIFF. My understanding is this one is difficult to handle. I haven't spent much time working with it, but would be interested if slideio supported it.

FileType Method - Not sure if you've considered this, but it would be handy if slideio had a method to detect supported filetypes. Openslide has this method (detect_format...which returns None if filetype is not supported.) This would enable slideio to cleanly handle formats it doesn't support (eg some tif formats, as well as remove the need for basic filetype checking in implementation code.) Just a thought. =)

Slideio is a very nice library and your responsiveness in improving it is impressive.

Booritas commented 1 year ago

Sorry for the delay with the answer.

DICOMDIR

Yes, it is still possible to load the DICOMDIR. Just use path to the DICOMDIR file instead of the DICOM file. BTW, if you specify path to a directory with DICOM files, the driver will create a set of 3d/4d scenes sorted by series (one series is a single scene with multiple files that represent z slices or(and) time frames.

Dicom Metadata

It is strange that you cannot retrieve the metadata. Do you use the latest version (2.0.2)? I tried it again on your sample and it works. Here is the code I run:

import slideio
import json

slide = slideio.open_slide('D:\\Projects\\slideio\\images\\dcm\\private\\H01EBB49P-24900\\H01EBB49P-24900_label.dcm','DCM')
scene = slide.get_scene(0)
raw_matadata = scene.get_raw_metadata()
metadata = json.loads(raw_matadata)
metadata

Here is the output:

{
 '00080008': {'vr': 'CS', 'Value': ['ORIGINAL', 'PRIMARY', 'LABEL', 'NONE']},
 '00080016': {'vr': 'UI', 'Value': ['1.2.840.10008.5.1.4.1.1.77.1.6']},
 '00080018': {'vr': 'UI',
  'Value': ['1.2.826.0.1.3680043.10.559.3908940780013874899074392838242720451']},
 '00080020': {'vr': 'DA'},
 '00080021': {'vr': 'DA'},
 '00080023': {'vr': 'DA', 'Value': ['20221213']},
 '0008002A': {'vr': 'DT', 'Value': ['20221213094124.706974']},
 '00080030': {'vr': 'TM'},
 '00080031': {'vr': 'TM'},
 '00080033': {'vr': 'TM', 'Value': ['094124.706974']},
 '00080050': {'vr': 'SH'},
 '00080060': {'vr': 'CS', 'Value': ['SM']},
 '00080070': {'vr': 'LO', 'Value': ['Pramana']},
 '00080080': {'vr': 'LO'},
 '00080090': {'vr': 'PN'},
 '00081030': {'vr': 'LO'},
 '0008103E': {'vr': 'LO'},
 '00081040': {'vr': 'LO'},
...

Mirax slides

Unfortunately, 3DHISTECH does not open their file format. OpenSlide has done some reverse engineering of the slides and can open some versions of the slides. However, in my experience, it does not work with the latest versions of the images. 3DHISTECH offers a Windows SDK, which can be installed separately and used in the software. Do you think it is useful to have a Windows-only driver?

NDPI driver

I know, the NDPI driver has some performance problems. Partially, it is because of the way Hamamatsu saves tiles in the TIFF-like file (they have their own version of the TIFF format which deviates from the standard). I will look at the performance and see what I can do.

FyleType method

Yes, I thought about it. It should be easy to add.

Thanks a lot for your help!

bnapora commented 1 year ago

Thanks for your excellent work on this project.