binli123 / dsmil-wsi

DSMIL: Dual-stream multiple instance learning networks for tumor detection in Whole Slide Image
MIT License
357 stars 88 forks source link

TCGA data download #16

Open LITTLEKKKK opened 3 years ago

LITTLEKKKK commented 3 years ago

When I come to the website, it says: “All slide and diagnostic images from the TCGA program are currently unavailable for download”. Could you share the lung datasets by using a Google Cloud link? : )

binli123 commented 3 years ago

I believe the Google Drive link is posted in the readme. I have emphasized the link and updated the readme file. Could you check if the link in the section Processing raw WSI data->Download WSIs->From Google Drive works for you?

GeorgeBatch commented 2 years ago

Hi Bin,

Do you have any advice on how to download the Google Drive folder with the TCGA files from a terminal? I tried using gdown, but it only allows to download folders with at most 50 files.

Best wishes, George

binli123 commented 2 years ago

Hi Bin,

Do you have any advice on how to download the Google Drive folder with the TCGA files from a terminal? I tried using gdown, but it only allows to download folders with at most 50 files.

Best wishes, George

I have never tried it with a terminal. But I think one of the appropriate ways to download a large number of files is to use the Google Drive desktop app and select the folder to sync to your local device.

GeorgeBatch commented 2 years ago

Thanks! How large is the Google Drive folder that you provided?

binli123 commented 2 years ago

Thanks! How large is the Google Drive folder that you provided?

I think it is around 800GB.

GeorgeBatch commented 2 years ago

Thanks! How large is the Google Drive folder that you provided?

I think it is around 800GB.

Thanks, I'll try using the desktop app, but I do not have 800GB of memory on my machine.

binli123 commented 2 years ago

Thanks! How large is the Google Drive folder that you provided?

I think it is around 800GB.

Thanks, I'll try using the desktop app, but I do not have 800GB of memory on my machine.

You could also just use the cropped patches I uploaded, they are less than 100GB

GeorgeBatch commented 2 years ago

Thanks! How large is the Google Drive folder that you provided?

I think it is around 800GB.

Thanks, I'll try using the desktop app, but I do not have 800GB of memory on my machine.

You could also just use the cropped patches I uploaded, they are less than 100GB

Thank you!

GeorgeBatch commented 2 years ago

Hi Bin,

I am trying to understand which of the files from the Google Drive folder I actually need.

In TCGA-lung-WSI folder, all the .svs files are enclosed in folders, e.g. ffa686dc-0f3c-4fb8-af3b-ee82a940752a folder for the ffa686dc-0f3c-4fb8-af3b-ee82a940752a.svs WSI. Each of them also seems to have a corresponding logs folder. Can you please explain what is there and why it is needed?

A similar thing is true about the TCGA-lung-WSI-corrupt folder, but here each of the WSI subfolders also has an annotations.txt file. Can you also please also explain why the corrupted WSIs have annotations, while all the other WSIs don't?

Many thanks, George

binli123 commented 2 years ago

Hi Bin,

I am trying to understand which of the files from the Google Drive folder I actually need.

In TCGA-lung-WSI folder, all the .svs files are enclosed in folders, e.g. ffa686dc-0f3c-4fb8-af3b-ee82a940752a folder for the ffa686dc-0f3c-4fb8-af3b-ee82a940752a.svs WSI. Each of them also seems to have a corresponding logs folder. Can you please explain what is there and why it is needed?

A similar thing is true about the TCGA-lung-WSI-corrupt folder, but here each of the WSI subfolders also has an annotations.txt file. Can you also please also explain why the corrupted WSIs have annotations, while all the other WSIs don't?

Many thanks, George

Those are just download logs that automatically generated when you download something from NCI data portal. A small portion of the WSI has coarse annotations that come with the slide and those low quality ones (also scanned with a lower mag) just happen to have it. I guess those are uploaded by a specific facility who also annotated the slides.

GeorgeBatch commented 2 years ago

Makes sense, thank you!

LITTLEKKKK commented 2 years ago

I didn't find cropped patches in Google Drive folder. Where is the link? Thanks.

binli123 commented 2 years ago

I didn't find cropped patches in Google Drive folder. Where is the link? Thanks.

https://drive.google.com/file/d/17zCn-WRNzxxxh8kkdBTbDLDZy0XZ3RIu/view?usp=sharing

LITTLEKKKK commented 2 years ago

Thanks a lot. The cropped patches zip file is often broken off and not stable. Did you upload unzip files of cropped patches before? : (

GeorgeBatch commented 2 years ago

Also, it looks like the command should include the download specification.

  $ cd tcga-download
  $ gdc-client download -m gdc_manifest.2020-09-06-TCGA-LUAD.txt --config config-LUAD.dtt
  $ gdc-client download -m gdc_manifest.2020-09-06-TCGA-LUSC.txt --config config-LUSC.dtt

instead of

  $ cd tcga-download
  $ gdc-client -m gdc_manifest.2020-09-06-TCGA-LUAD.txt --config config-LUAD.dtt
  $ gdc-client -m gdc_manifest.2020-09-06-TCGA-LUSC.txt --config config-LUSC.dtt
binli123 commented 2 years ago

Also, it looks like the command should include the download specification.

  $ cd tcga-download
  $ gdc-client download -m gdc_manifest.2020-09-06-TCGA-LUAD.txt --config config-LUAD.dtt
  $ gdc-client download -m gdc_manifest.2020-09-06-TCGA-LUSC.txt --config config-LUSC.dtt

instead of

  $ cd tcga-download
  $ gdc-client -m gdc_manifest.2020-09-06-TCGA-LUAD.txt --config config-LUAD.dtt
  $ gdc-client -m gdc_manifest.2020-09-06-TCGA-LUSC.txt --config config-LUSC.dtt

They also updated the download client, I might just remove this part from the readme

binli123 commented 2 years ago

Thanks a lot. The cropped patches zip file is often broken off and not stable. Did you upload unzip files of cropped patches before? : (

Which operating system do you use?

LITTLEKKKK commented 2 years ago

Win. I use IDM to download the file.

GeorgeBatch commented 2 years ago

TCGA slides are back online. But I needed to generate the manifest files from scratch. I originally wanted to used yours, but some of the file names were not found, maybe they changed them.

Are these TCGA-LUAD (541 slides) and TCGA-LUSC (512 slides) the links you used to get the manifest files?

I ended up there by clicking on "diagnostic slides" from the main links:

LITTLEKKKK commented 2 years ago

Is there a Google Drive link for Camelyon 16 cropped patches? Thanks.

I didn't find cropped patches in Google Drive folder. Where is the link? Thanks.

https://drive.google.com/file/d/17zCn-WRNzxxxh8kkdBTbDLDZy0XZ3RIu/view?usp=sharing

Raymvp commented 7 months ago

Thanks! How large is the Google Drive folder that you provided?

I think it is around 800GB.

Thanks, I'll try using the desktop app, but I do not have 800GB of memory on my machine.

You could also just use the cropped patches I uploaded, they are less than 100GB

What is the magnification of these patches? 20 or 5? The picture looks blurry