DigitalSlideArchive / DSA-WSI-DeID

A workflow for redacting PHI from whole slide images (WSI) based on the Digital Slide Archive.
https://digitalslidearchive.github.io/DSA-WSI-DeID/
Apache License 2.0
14 stars 8 forks source link

Redaction failing on .ndpi images. Failed to redact item 'bytes' object has no attribute 'encode' #400

Closed Sravani-K closed 6 months ago

Sravani-K commented 7 months ago

Hi , Redaction is failing on .ndpi files. Here is the log taken from error.log from girder docker container.

[2024-02-16 08:18:54,097] ERROR: Failed to redact item
Traceback (most recent call last):
  File "/venv/lib/python3.11/site-packages/wsi_deid/rest.py", line 156, in process_item
    filepath, info = process.redact_item(item, tempdir)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/wsi_deid/process.py", line 422, in redact_item
    file, mimetype = func(item, tempdir, redactList, newTitle, labelImage, macroImage)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/wsi_deid/process.py", line 1027, in redact_format_hamamatsu
    tifftools.write_tiff(ifds, outputPath)
  File "/venv/lib/python3.11/site-packages/tifftools/tifftools.py", line 320, in write_tiff
    ifdPtr = write_ifd(dest, bom, bigtiff, ifd, ifdPtr, ifdsFirst=ifdsFirst)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/tifftools/tifftools.py", line 533, in write_ifd
    data = data.encode() + b'\x00'
           ^^^^^^^^^^^
AttributeError: 'bytes' object has no attribute 'encode'
Additional info:
  Request URL: PUT http://localhost:8080/api/v1/wsi_deid/item/65cf122a073531c8a8e858b9/action/process
manthey commented 7 months ago

Do you know if it is failing on all ndpi files, or just some? We test against some ndpi files as part of the CI, so it might be something specific about your ndpi file. And, is there anything you can share about the specific file that failed? How large is it (I don't think we've ever tested with ndpi files that are more than 4GB is size)?

Sravani-K commented 7 months ago

It is failing on all our ndpi files. Yes, tried the test ndpi images which CI pipepine is pointing to and those are working fine. Noticed differences between test ndpi images and our images. Particularly, a large gap in when they are Created and structuring of header. May be some upgrade/modification to ndpi format is causing the failure?

header of our ndpi images: II*▒▒▒ HamamatsuC13220NZAcquire 2.0.4.402022:08:11 09:25:56.Version=1..0 Created=2021/02/14 Updated=2022/07/28 Property.Version=3.0 Calibration.Version=500 Product=C13220 NDP.S/N=000280 Macro.S/N=000284 Objective.Lens.Magnificant=20

header of test ndpi images II*▒▒)HamamatsuC12000-02NDP.scan 3.0.42012:10:30 18:02:27477130870003Created=2012/07/24 Updated=2012/10/29 Product=C12000-02 system.version=1.0 NDP.S/N=870003 Macro.S/N=9Y0437 roi.slide.macro=117,278,1277,674 roi.barcode.macro=988,297,1245,655

dgutman commented 7 months ago

This is one of the issues with all of these formats.. we have a test set of images but of course when stuff in the black box changes (i.e. the NDPI file format), we don't always know about it. Are there any failing slides that don't have PHI, or at least other slides scanned with the same scanner without PHI that you could run through and see if they also crash?

On Wed, Feb 28, 2024 at 8:18 AM Sravani @.***> wrote:

It is failing on all our ndpi files. Yes, tried the test ndpi images which CI pipepine is pointing to and those are working fine. Noticed differences between test ndpi images and our images. Particularly, the a large gap in when they are Created and structuring of header. May be some upgrade/modification to ndpi format is causing the failure?

header of our ndpi images: II*▒▒▒ HamamatsuC13220NZAcquire 2.0.4.402022:08:11 09:25:56.Version=1..0 Created=2021/02/14 Updated=2022/07/28 Property.Version=3.0 Calibration.Version=500 Product=C13220 NDP.S/N=000280 Macro.S/N=000284 Objective.Lens.Magnificant=20

header of test ndpi images II*▒▒)HamamatsuC12000-02NDP.scan 3.0.42012:10:30 18:02:27477130870003Created=2012/07/24 Updated=2012/10/29 Product=C12000-02 system.version=1.0 NDP.S/N=870003 Macro.S/N=9Y0437 roi.slide.macro=117,278,1277,674 roi.barcode.macro=988,297,1245,655

— Reply to this email directly, view it on GitHub https://github.com/DigitalSlideArchive/DSA-WSI-DeID/issues/400#issuecomment-1968967166, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFODTSEQTEH366MPDRHGLTYV4VCHAVCNFSM6AAAAABDLYWMYGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRYHE3DOMJWGY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- David A Gutman, M.D. Ph.D. Associate Professor of Pathology Emory University School of Medicine

manthey commented 7 months ago

Thanks, @Sravani-K . I often use a specific python tool for analyzing files so I can see the very specific structure. If you have any python environment, can you do pip install tifftools and then tifftools dump <path to your ndpi file> and post the output. If there is PHI in the internal metadata, it could be visible in this dump, so please redact as needed before sharing.

Sravani-K commented 7 months ago

@manthey PFA the output from tifftools dump. anon_ndpi_dump.txt

manthey commented 7 months ago

I've hardened the tifftools writer (https://github.com/DigitalSlideArchive/tifftools/pull/90). It will take a little time to percolate through CI and for new docker images for WSI DeID to be published.

Once you confirm this works on your ndpi files, can you close this issue?

manthey commented 7 months ago

CI has published a new dsarchive/wsi_deid docker image.

Sravani-K commented 7 months ago

@manthey thanks for the fix. It solved the error that we faced earlier. Now, we are getting another error as below.

[2024-03-04 08:23:23,292] ERROR: 500 Error
Traceback (most recent call last):
  File "/venv/lib/python3.11/site-packages/girder/api/rest.py", line 655, in endpointDecorator
    val = fun(self, path, params)
          ^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/girder/api/rest.py", line 1233, in GET
    return self.handleRoute('GET', path, params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    result = self.imageItemModel.getAssociatedImage(item, image, **params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/girder_large_image/models/image_item.py", line 645, in getAssociatedImage
    return self._getAndCacheImageOrData(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/girder_large_image/models/image_item.py", line 416, in _getAndCacheImageOrData
    return self.getAndCacheImageOrDataRun(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/girder_large_image/models/image_item.py", line 432, in getAndCacheImageOrDataRun
    result = getattr(tileSource, imageFunc)(**kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/large_image/tilesource/base.py", line 2365, in getAssociatedImage       image = self._getAssociatedImage(imageKey)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/large_image_source_openslide/__init__.py", line 392, in _getAssociatedImage
    return PIL.Image.open(tiff_buffer)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/PIL/Image.py", line 3309, in open    raise UnidentifiedImageError(msg)
PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x7fca1b7393a0>
Additional info:
  Request URL: GET http://<xxxx.xxx.xxx.org>/api/v1/item/65cef4fdbd53c0e78daf6ff5/tiles/images/nonempty
Sravani-K commented 7 months ago

This is one of the issues with all of these formats.. we have a test set of images but of course when stuff in the black box changes (i.e. the NDPI file format), we don't always know about it. Are there any failing slides that don't have PHI, or at least other slides scanned with the same scanner without PHI that you could run through and see if they also crash? On Wed, Feb 28, 2024 at 8:18 AM Sravani @.> wrote: It is failing on all our ndpi files. Yes, tried the test ndpi images which CI pipepine is pointing to and those are working fine. Noticed differences between test ndpi images and our images. Particularly, the a large gap in when they are Created and structuring of header. May be some upgrade/modification to ndpi format is causing the failure? header of our ndpi images: II▒▒▒ HamamatsuC13220NZAcquire 2.0.4.402022:08:11 09:25:56.Version=1..0 Created=2021/02/14 Updated=2022/07/28 Property.Version=3.0 Calibration.Version=500 Product=C13220 NDP.S/N=000280 Macro.S/N=000284 Objective.Lens.Magnificant=20 header of test ndpi images II▒▒)HamamatsuC12000-02NDP.scan 3.0.42012:10:30 18:02:27477130870003Created=2012/07/24 Updated=2012/10/29 Product=C12000-02 system.version=1.0 NDP.S/N=870003 Macro.S/N=9Y0437 roi.slide.macro=117,278,1277,674 roi.barcode.macro=988,297,1245,655 — Reply to this email directly, view it on GitHub <#400 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFODTSEQTEH366MPDRHGLTYV4VCHAVCNFSM6AAAAABDLYWMYGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRYHE3DOMJWGY . You are receiving this because you are subscribed to this thread.Message ID: @.> -- David A Gutman, M.D. Ph.D. Associate Professor of Pathology Emory University School of Medicine

@dgutman there are no failing slides that don't have PHI data. Tried redacting an already redacted image using anonymize-slide. It does not crash!

dgutman commented 7 months ago

S o most of the NDPI files we have tested with are likely much older than 2021... and I am sure that something likely changed in the header. We would probably need some non-PHI but still has labels/macro image slides shared with us somehow to even begin to debug.

On Thu, Mar 7, 2024 at 1:19 PM Sravani @.***> wrote:

This is one of the issues with all of these formats.. we have a test set of images but of course when stuff in the black box changes (i.e. the NDPI file format), we don't always know about it. Are there any failing slides that don't have PHI, or at least other slides scanned with the same scanner without PHI that you could run through and see if they also crash? … <#m-6376904312279482965> On Wed, Feb 28, 2024 at 8:18 AM Sravani @.*> wrote: It is failing on all our ndpi files. Yes, tried the test ndpi images which CI pipepine is pointing to and those are working fine. Noticed differences between test ndpi images and our images. Particularly, the a large gap in when they are Created and structuring of header. May be some upgrade/modification to ndpi format is causing the failure? header of our ndpi images: II▒▒▒ HamamatsuC13220NZAcquire 2.0.4.402022:08:11 09:25:56.Version=1..0 Created=2021/02/14 Updated=2022/07/28 Property.Version=3.0 Calibration.Version=500 Product=C13220 NDP.S/N=000280 Macro.S/N=000284 Objective.Lens.Magnificant=20 header of test ndpi images II▒▒)HamamatsuC12000-02NDP.scan 3.0.42012:10:30 18:02:27477130870003Created=2012/07/24 Updated=2012/10/29 Product=C12000-02 system.version=1.0 NDP.S/N=870003 Macro.S/N=9Y0437 roi.slide.macro=117,278,1277,674 roi.barcode.macro=988,297,1245,655 — Reply to this email directly, view it on GitHub <#400 (comment) https://github.com/DigitalSlideArchive/DSA-WSI-DeID/issues/400#issuecomment-1968967166>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFODTSEQTEH366MPDRHGLTYV4VCHAVCNFSM6AAAAABDLYWMYGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRYHE3DOMJWGY https://github.com/notifications/unsubscribe-auth/AAFODTSEQTEH366MPDRHGLTYV4VCHAVCNFSM6AAAAABDLYWMYGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRYHE3DOMJWGY . You are receiving this because you are subscribed to this thread.Message ID: @.***> -- David A Gutman, M.D. Ph.D. Associate Professor of Pathology Emory University School of Medicine

@dgutman https://github.com/dgutman there are no failing slides that don't have PHI data. Tried redacting an already redacted image using anonymize-slide https://github.com/bgilbert/anonymize-slide/blob/master/anonymize-slide.py. It does not crash!

— Reply to this email directly, view it on GitHub https://github.com/DigitalSlideArchive/DSA-WSI-DeID/issues/400#issuecomment-1984159371, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFODTULAXLY3MBWFVP57ALYXCVRJAVCNFSM6AAAAABDLYWMYGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBUGE2TSMZXGE . You are receiving this because you were mentioned.Message ID: @.***>

-- David A Gutman, M.D. Ph.D. Associate Professor of Pathology Emory University School of Medicine

manthey commented 7 months ago

Fundamentally, ndpi isn't tiff, but it is almost tiff. Based on the tifftools dump that you sent, the issue is that the "nonempty" tissue mask image is stored in a tiff-like IFD that claims it is 1 sample per pixel AND is RGB (these are contradictory). We ask PIL to parse this, and PIL refuses. I can add a work around just for hamamatsu. It'd be nice if they actually adhered to an imaging standard. I may have a workaround soon.

manthey commented 7 months ago

See https://github.com/girder/large_image/pull/1472.

manthey commented 7 months ago

@Sravani-K Thank you for your patience. Can you try again?

easwarpalvai commented 6 months ago

Thanks @manthey , Now we're able to redact Macro section of our Images. currently we're unable to redact hamamatsu.Reference under Large Image Metadata(header), It is failing with below error [2024-03-12 05:53:21,457] ERROR: Failed to redact item Traceback (most recent call last): File "/venv/lib/python3.11/site-packages/wsi_deid/rest.py", line 156, in process_item filepath, info = process.redact_item(item, tempdir) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/venv/lib/python3.11/site-packages/wsi_deid/process.py", line 396, in redact_item labelImage = add_title_to_image(labelImage, newTitle, previouslyRedacted, item=item) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/venv/lib/python3.11/site-packages/wsi_deid/process.py", line 1554, in add_title_to_image fontSize = fontSize targetW 0.9 / textW


ZeroDivisionError: float division by zero

On investigating the error, I've noticed the following:

In the add_title_to_image method within process.py, the variable **title** is initialized using the expression **title = title or ''**. Despite this, it consistently evaluates to an empty string. Consequently, textW is calculated as zero, leading to a failure in the redaction process. We would appreciate assistance in resolving this issue.
manthey commented 6 months ago

Please try again. I added a guard.

easwarpalvai commented 6 months ago

Thanks for swift response @manthey 👍, this fixed most of our issues if any new issue occurs, we'll create new issue or re-open this

manthey commented 6 months ago

Great. I'll close this issue.