PNNL-CompBio / coderdata

Automation scripts and benchmark dataset package for cancer drug prediction deep learning models.
Other
11 stars 3 forks source link

Rare HCMI Omics Issue resulting in missing data during build #223

Closed jjacobson95 closed 1 month ago

jjacobson95 commented 1 month ago

When the manifest files are downloaded, it is possible that some files fail to download and are ultimately excluded from the build process. We need to make this process more robust so it either fails and exits when this happens, or better, it re-runs the failed files until they download correctly.

This possibility is present in all HCMI omics builds.

100% [############################################] Time:  0:00:02   1.2 MiB/s 
100% [############################################] Time:  0:00:02   1.1 MiB/s 
100% [############################################] Time:  0:00:02   1.2 MiB/s 
2024-10-08 16:33:04 ERROR: 74c5ab0d-cbff-4b5d-a864-63935a0de1d0: 500 Server Error: INTERNAL SERVER ERROR for url: https://api.gdc.cancer.gov/data/74c5ab0d-cbff-4b5d-a864-63935a0de1d0: {"message":"internal server error"}
2024-10-08 16:33:04 
2024-10-08 16:33:04 ERROR: 5e293a32-9ba4-423c-8057-55d4fe52b45c: Unable to connect to API: (HTTPSConnectionPool(host='api.gdc.cancer.gov', port=443): Read timed out. (read timeout=60)). Is this url correct: 'https://api.gdc.cancer.gov/data/5e293a32-9ba4-423c-8057-55d4fe52b45c'? Is there a connection to the API? Is the server running?
2024-10-08 16:33:04 ERROR: d5f0b966-aad9-48a6-be40-fb286e6a7dd2: 500 Server Error: INTERNAL SERVER ERROR for url: https://api.gdc.cancer.gov/data/d5f0b966-aad9-48a6-be40-fb286e6a7dd2: {"message":"internal server error"}
2024-10-08 16:33:04 
100% [############################################] Time:  0:00:02   1.2 MiB/s 
100% [############################################] Time:  0:00:02   1.3 MiB/s 
100% [############################################] Time:  0:00:02   1.3 MiB/s 
2024-10-08 16:33:29 Successfully downloaded: 1136
jjacobson95 commented 1 month ago

Just a note, this error is more common than previously thought, occurring in almost every run. There are a couple variations of the error that are appearing including the following:

bb81-8abe098d889f: 410 Client Error: Gone for url: https://api.gdc.cancer.gov/legacy/data?compress

This error indicates that the particular file is permanently not available to be downloaded or has been removed, however the error is inconsistent and appears for different (or none) files each time the gdc tool is run.

jjacobson95 commented 1 month ago

This will be resolved in the build_all_updates branch. Still doing some tests, but it seems to be working.

Output of Fixed code: logs are printing out of order, but you can tell what is happening in here.

...
100% [############################################] Time:  0:00:02   1.1 MiB/s 
100% [############################################] Time:  0:00:02   1.3 MiB/s 
Successfully downloaded: 1137
Failed downloads: 2
100% [############################################] Time:  0:00:02   1.2 MiB/s 
100% [############################################] Time:  0:00:04 794.0 KiB/s 
Successfully downloaded: 2
gdc-client already installed
Using provided manifest and downloading data...
Using gdc tool and retrieving get metadata...
Total files to download: 1139
Starting initial download...
Initial download complete.

Retrying download for 2 files (Attempt 1/5):
  Missing files: 2
    File IDs: 70efab6a-c0d3-403a-8708-880136723d1f, 4b362fa9-4031-4404-8522-cf19308dea49
Starting retry 1 download...
Retry 1 complete.

All files downloaded and verified successfully.

All files downloaded and verified successfully after retries.
Extracting UUIDs from manifest...