Tile vector representations question

AdalbertoCq / Histomorphological-Phenotype-Learning

Corresponding code of 'Quiros A.C.+, Coudray N.+, Yeaton A., Yang X., Chiriboga L., Karimkhan A., Narula N., Pass H., Moreira A.L., Le Quesne J.*, Tsirigos A.*, and Yuan K.* Mapping the landscape of histomorphological cancer phenotypes using self-supervised learning on unlabeled, unannotated pathology slides. 2024'

46 stars 10 forks source link

Hello @AdalbertoCq, My intention is to find tile vector representations for a given single svs file. If I understand corrently, the real_hdf5 is the h5 file we get as output from DeepPath preprocess, and the checkpoint is weight we get from step 1 (I used your provided weights here). When I try to run the script run_representationspathology_projection.py, it gives following error:

/models/selfsupervised/BarlowTwins.py", line 85, in __init__ self.num_samples = data.training.images.shape[0] AttributeError: 'NoneType' object has no attribute 'images'

This error esstially happens because incorrect setting of dataset argument when I chase through the code if I'm right.

Can you explain more about what is the argument dataset of run_representationspathology_projection.py? Which dataset it is refering? And more importantly, why do I need another dataset during vectorising tiled images, given I already have the image h5 file (real_hdf5) and a pre trained model (checkpoint .ckt file).

Thank you in advance

Hi @wuyu-z,

Yeah, it seems like that's the case.

The dataset argument is the same one used in the training script for the self-supervised model. The purpose of it is to keep track of the dataset used to train the SSL model, the script will place the tile vector representations H5 into a directory with the name of the dataset. E.g.:

Dataset: TCGAFFPE_LUADLUSC_5x_60pc_250K
Real_hdf5: hdf5_NYUFFPE_LUADLUSC_5x_60pc_he_combined.h5
Output file: results/BarlowTwins_3/TCGAFFPE_LUADLUSC_5x_60pc_250K/h224_w224_n3_zdim128/hdf5_NYUFFPE_LUADLUSC_5x_60pc_he_combined.h5

In this case, the real_hdf5 file refers to an external cohort, but the self-supervised model was trained on a subsample of 250K tiles of WSI from TCGA.

It also uses this file to check the format of the images in the H5 file (height, width, # of channels) and instantiate the model.

If you just want to use the pre-trained model and find the tile vector representations, there's a workaround for this. You can take the real_hdf5 file and create a dataset with just the training H5 file, you should be able to provide it as an argument and run the projections. E.g.:

Dataset directory and file: datasets/name_dataset_1/he/patches_h224_w224/hdf5_name_dataset_1_he_train.h5
dataset argument: name_dataset_1

Otherwise, you can find the dataset with the LUAD & LUSC 250K tiles here. You can set up the directory with this and run the tile vector representations with it.

I hope this helps. Thanks, Adal

AdalbertoCq / Histomorphological-Phenotype-Learning

Tile vector representations question #1