MIT-LCP / mimic-code

MIMIC Code Repository: Code shared by the research community for the MIMIC family of databases
https://mimic.mit.edu
MIT License
2.51k stars 1.5k forks source link

MIMIC-CXR-JPG sha256 sum mismatch #1763

Open amitsaha opened 1 month ago

amitsaha commented 1 month ago

Prerequisites

Description

I downloaded the MIMIC-CXR-JPG dataset from google cloud storage).

When I go to verify the sha256 sum, I find the following mismatches:

$ sha256sum -c SHA256SUMS.txt  --quiet
files/p10/p10375986/s59475126/44685902-a2ada121-02735bc5-bf1bf167-adfd2ae5.jpg: FAILED
files/p11/p11131026/s59741822/08c22db9-5bef7d06-d904ec15-7bbfe57f-416dbdc1.jpg: FAILED
files/p11/p11607063/s58298420/235c7af4-ef2ba0dc-7dc251ea-a2571f33-d37c8185.jpg: FAILED
files/p11/p11785297/s58022353/3b64bf5a-021ff5ae-137c22d1-5529364f-1415c640.jpg: FAILED
files/p11/p11920643/s55676416/4d70ff33-43ad77af-22ff047c-19f6ceb1-aae49eea.jpg: FAILED
files/p13/p13283178/s55081421/026de108-3310a177-7c01791c-7eb32cff-b076122f.jpg: FAILED
files/p13/p13628037/s54872639/f845ad66-716c76dd-da718912-8b0ff596-b30d25cb.jpg: FAILED
files/p13/p13694166/s55805720/df57d48e-566984d2-fbe39e6e-0c68fc55-380f1217.jpg: FAILED
files/p14/p14656449/s56499991/67a4e5cd-50d441d3-42294f94-363ac071-17cfc342.jpg: FAILED
files/p14/p14690121/s50057475/34ad06d4-475863f1-f3712cec-783c3b99-308cf886.jpg: FAILED
files/p17/p17405329/s55291678/283084bb-0f4994a7-d7622b32-d7f18f75-d8dde41b.jpg: FAILED
files/p17/p17490145/s55463370/803fcbd8-2e38a5c7-cca96a50-ce5660cb-83ecc3a1.jpg: FAILED
files/p18/p18459824/s52186356/2eb68b2f-0742cb3d-b8c9db5b-9c9d74f9-69e31cc1.jpg: FAILED
files/p18/p18690742/s56844948/f4f63777-6a8a6b60-d6cb0718-9256537a-2ca41831.jpg: FAILED
sha256sum: WARNING: 14 computed checksums did NOT match

I redownloaded the above files, and still the same result.

For eg, if i take the first one:

$ sha256sum files/p10/p10375986/s59475126/44685902-a2ada121-02735bc5-bf1bf167-adfd2ae5.jpg
8a95fb444bdfec8087c49f5fb0742e6674568dd7aca839a30310a6fdb4ff427c  files/p10/p10375986/s59475126/44685902-a2ada121-02735bc5-bf1bf167-adfd2ae5.jpg

$ cat SHA256SUMS.txt | grep files/p10/p10375986/s59475126/44685902-a2ada121-02735bc5-bf1bf167-adfd2ae5.jpg
ed9a93b1fd0c9ff7c0601a79c8f6ae91c49b524a1b9a34315e065a830829df1b files/p10/p10375986/s59475126/44685902-a2ada121-02735bc5-bf1bf167-adfd2ae5.jpg

Any ideas on how to further verify what may be causing this?

amitsaha commented 1 month ago

I downloaded the above files once again from gcs and ran a diff against my original copy, and the images are looking good, no diff. So I guess, we just need to update the SHA256SUMS.txt file.

My script used for the above:

import os
import subprocess

# The sha256 checksum of these images don't match with the ones reported in SHA256SUMS.txt
# so we download them locally and do a diff to ensure they are the same images
# which likely means the sha256 sums need updating
image_paths = [
    "files/p10/p10375986/s59475126/44685902-a2ada121-02735bc5-bf1bf167-adfd2ae5.jpg",
    "files/p11/p11131026/s59741822/08c22db9-5bef7d06-d904ec15-7bbfe57f-416dbdc1.jpg",
    "files/p11/p11607063/s58298420/235c7af4-ef2ba0dc-7dc251ea-a2571f33-d37c8185.jpg",
    "files/p11/p11785297/s58022353/3b64bf5a-021ff5ae-137c22d1-5529364f-1415c640.jpg",
    "files/p11/p11920643/s55676416/4d70ff33-43ad77af-22ff047c-19f6ceb1-aae49eea.jpg",
    "files/p13/p13283178/s55081421/026de108-3310a177-7c01791c-7eb32cff-b076122f.jpg",
    "files/p13/p13628037/s54872639/f845ad66-716c76dd-da718912-8b0ff596-b30d25cb.jpg",
    "files/p13/p13694166/s55805720/df57d48e-566984d2-fbe39e6e-0c68fc55-380f1217.jpg",
    "files/p14/p14656449/s56499991/67a4e5cd-50d441d3-42294f94-363ac071-17cfc342.jpg",
    "files/p14/p14690121/s50057475/34ad06d4-475863f1-f3712cec-783c3b99-308cf886.jpg",
    "files/p17/p17405329/s55291678/283084bb-0f4994a7-d7622b32-d7f18f75-d8dde41b.jpg",
    "files/p17/p17490145/s55463370/803fcbd8-2e38a5c7-cca96a50-ce5660cb-83ecc3a1.jpg",
    "files/p18/p18459824/s52186356/2eb68b2f-0742cb3d-b8c9db5b-9c9d74f9-69e31cc1.jpg",
    "files/p18/p18690742/s56844948/f4f63777-6a8a6b60-d6cb0718-9256537a-2ca41831.jpg"
]

for image in image_paths:
    # download to a temporary directory
    subprocess.check_output([
                "gcloud", "storage", "--billing-project", "<project-name>", "cp",
                 f"gs://mimic-cxr-jpg-2.1.0.physionet.org/{image}", f"tmp-check-diff/{os.path.basename(image)}"

    ])
    # check the downloaded version against the one stored locally already
    subprocess.check_output([
                "diff",
                 f"tmp-check-diff/{os.path.basename(image)}", f"{image}"

    ])
alistairewj commented 2 weeks ago

It's odd because the SHA256SUMs are calculated automatically by PhysioNet when publishing the files, so I'm not sure why they would be wrong. Could be because we had some custom workarounds for MIMIC-CXR. I will raise with some of the PhysioNet team, thanks!

alistairewj commented 2 weeks ago

OK, I think something went wrong with our GCP upload because they were simply different on that bucket. Can you redownload them from the GCP bucket and check again? It should be fixed now.