Closed BenPashley closed 1 month ago
Thank you for your interest.
The DINOv2 (CLAM) extractor provides initial feature representations (which are saved in your storage as .pt files). And each of these .pt files has a shape of (sequence_length, hidden_dim), in which sequence_length is the number of patches of a WSI, and the hidden_dim of DINOv2 is 1024.
Using these pre-extracted features as input, the hierarchical visual encoder module further processes them to obtain better representations.
As indicated in the paper, the hierarchical visual encoder module's final output is a tensor with shape (region_number, hidden_dim), which pools the input sequences into several region representations. And this is not implemented in CLAM. (Because generally when we talk about feature extraction, we mean the pretrained feature extractors like DINOv2. While the hierarchical visual encoder should be seen as a multi-instance learning method.)
If you want to include this hierarchical visual encoder into feature extraction procedure, then you might need to take out this module from histgen_modules.py and implement it in CLAM.
Many thanks. I was a little confused by the visual extractor class initialised in the HistGenModel and not used? I thought this was involved.
So, I've trained the model using your weights and extracted features and the results are not quite what I was expecting. I appreciate that the wording will not be that concise as it's based upon abbreviated text from your ground truth pre-processed reports, but it still doesn't appear accurate and quite mixed (I've had a senior histopathogist review). Would you mind checking the attached files and letting me know if these are similar to your own results?
Hi, for your first question, the visual extractor class is not used and it's a legacy class. Maybe in the future we will refactor the code and remove it to improve readability.
For your second question, the reason why the generated reports seemed not accurate and quite mixed is that the ground truth is not good enough. The provided ground truth should be further processed to improve its readability. However, during the time we finished this work, we only used the provided version of ground truth. Therefore, a suggestion is, to use language model APIs to further preprocess the ground truth reports (for TCGA). And you could also train your own model if you have private WSI-report paired data. In short, the performance of the report generation model relies heavily on the given ground truth.
Hello again,
You have previously mentioned..
'we assume that DINOv2 provides good enough initial features. Yet for report generation, more delicately designed visual encoding procedure is needed. That's why we proposed the hierarchical visual encoding module to learn the features of WSI in both fine- and coarse-grained ways.'
Could you explain how I can generate embedding using the visual encoder module and CLAM (or otherwise). I don't see any example of this in your code? I'm very interested in leveraging your framework with my own dataset for further evaluation.