How to compute the metrics between testset predictions and true labels?

sindhura234 commented 2 years ago

I am using my custom data...After training, how can I compute the metrics between test set predictions and true labels? I am using Hippocampus data loader provided by you . But i have imagesTr, labelsTr for training , and imagesTs, labelsTs for testing. I want to compute metrics for the test set

sindhura234 commented 2 years ago

Understood a bit. There is cross validation that can be done with --fold arg within the train set imagesTr. What about the final evaluation? I have ground truth for evaluation data as well. How to get the results?

hoangtan96dl commented 2 years ago

Hello @cndu234, have you taken a look at evaluate.py file which contains the evaluation process to produce Dice coefficient, precision and recall metrics? Currently, I have not written the part that read samples from test folders (imagesTs, labelsTs) then I think you can add that part and returns it in test_dataloader of the LightningDataModule. After that, to use the test dataloader in evaluate.py you can change these lines from

data_module.setup("validate")
val_loader = data_module.val_dataloader()

to

data_module.setup("test")
test_loader = data_module.test_dataloader()

noushinha commented 2 years ago

Hello @hoangtan96dl and @cndu234, I am also developing a piece of code to do the evaluation on the test dataset. I am not much familiar with pytorch lightning so please ignore this completely if this does not make sense.

As far as I understood there is data module that can be set for fit, validation, and test stages. Setting the stage of the data module for test is a bit strange to me because if you set the stage to test then you should call trainer.test(model) and not the trainer.predict(model, datamodule/dataloader). You can check the examples here. This will just calculate the test loss/accuracy and the following will be printed:

--------------------------------------------------------------
TEST RESULTS
{'test_accuracy': 0.7894, 'test_loss': 1.1703}
--------------------------------------------------------------

What I was looking for was predicting on the test set to collect the predicted label maps (image outputs) and the class-based dice coefficients, precision, and recall. As far as I understood, this is not possible by setting the stage on "test". So I set the stage to be on "validate". But then this problem rises that the data module reads from the "training" key of the dataset.json file and not the "test" key. So I generated a new dataset.json file where the "training" key lists only the images in the imagesTs folder and the labels in the labelsTs folders as pairs. So the JSON file will be like this:

...
"**training**":[  # _<you should not have training data here. You should have test data!!!! to be able to predict>_
{image: "./**imagesTs/first_test_image.nii.gz**, label: "./**labelsTs/first_test_label.nii.gz"}, 
{image: "./**imagesTs/second_test_image.nii.gz**, label: "./**labelsTs/second_test_label.nii.gz"}, 
...],
"test":[ # <you can have anything here!!!! I could not find a way to use this key for "prediction">]

Then in the evaluate file, I call data_module.setup("validate", flag="test") and I modified the setup function in the data module file as follows:

def setup(self, stage=None, flag="validate"):
   ...
   elif stage == "validate":
       if flag == "validate": 
           _, val_data_dicts = self._load_data_dicts(train=True, flag="validate")
       else: 
           val_data_dicts = self._load_data_dicts(train=True, flag="test")
       self.valset = CacheDataset(
                data=val_data_dicts,
                transform=self.val_transforms,
                cache_rate=self.cache_rate,
                num_workers=self.num_workers,
            )

I could have had a self.test_transforms and so but I preferred to keep things simple till I understand the script better. You need to change the _load_data_dicts accordingly:

        if train:
            labels = sorted(glob.glob(os.path.join(self.root_dir, "segs", "*.mhd")))
            data_dicts = [{"image": img_name, "label": label_name} for img_name, label_name in zip(images, labels)]
            if flag == "validate":
               data_dicts_list = partition_dataset(data_dicts, num_partitions=4, shuffle=True, seed=0)
               train_dicts, val_dicts = [], []
               for i, data_dict in enumerate(data_dicts_list):
                  if i == self.fold:
                     val_dicts.extend(data_dict)
                  else:
                     train_dicts.extend(data_dict)
               return train_dicts, val_dicts
            else == "test":
               data_dicts_list = partition_dataset(data_dicts, num_partitions=1, shuffle=False, seed=0)
               val_dicts = []
               for i, data_dict in enumerate(data_dicts_list):
                  val_dicts.extend(data_dict)
               return val_dicts

I turned off shuffle because in my dataset, I am predicting over small subvolumes of a large volume which should be attached to each other later so at the time of attaching the subvolumes together, I want to make sure the sequence of outputs is the same as the input.

my test_dataloader looks like this:

def test_dataloader(self):
   return DataLoader(self.valset, batch_size=1, num_workers=self.num_workers)

you can use the one for val_dataloader. I just used this to keep things clear. So in the evaluate file, it will be like :

data_module.setup("validate", flag="test")
val_loader = data_module.test_dataloader()

I know this is not logical to have test data in the training key of JSON but I have worked around this for days and it seems there is a lack of support from the framework. Or I am a new bee and I don't know how to do it in the right way. This worked perfectly for me though prediction on the validation set is perfect on the test set it fails suggesting that the model overfits on my data. Please update me if you find a straightforward approach.

hoangtan96dl commented 2 years ago

Thank you @noushinha for your comment. Let me clarify a little bit about pytorch-lightning framework. If you want to strictly follow pytorch lightning way then you must follow their rules. (It means you should take time to learn their framework)

For example, as you pointed out, if I want to let Trainer handle the testing phase for me with Trainer.test then

My datamodule should implement the test_dataloader method
My lightningmodule should implement test_step, test_epoch_end, ... (anything called by Trainer.test)

The benefit of this is my main function will be very short and clear as you can see in the train.py file. If I have a new dataset or new model, the logic to load that data, how the model run will be self-defined in the module and I don't need to change much in the train.py file

However, another option is you can use Trainer.predict for more flexibility. This is the case in my evaluate.py file,

Consider datamodule as a function to return a pytorch dataloader.
Consider lightning module as usual pytorch model with forward function Then Trainer.predict is the same as you normally run with pytorch model
```
outputs = []
for ... in dataloader:
output = net(...)
outputs.append(output)
# Do something with outputs
```
Then I can write the whole evaluation loop on these outputs instead of defining it in validate_step or test_step of the lightning module.

Another suggestion is if you want to use a custom dataset, you should look at datamodule of iSeg or LUNA dataset. The hippocampus and cardiac dataset use supported function load_decathlon_datalist from MONAI library and it is the reason why it required the json file with correct format

I know it is pretty hard to understand and modify if you are not familiar with these frameworks but there are two reasons I want to use them:

Reproducibility: I have seen many source codes with self-defined preprocessing data, self-defined metrics, self-defined overlapping inference with hidden bugs and I have also experienced it myself. Hence, by letting MONAI handle these parts I do not need to worry about these errors anymore
Expandability: as I pointed out the benefit of pytorch lightning framework above, it helped me try different methods, different dataset quickly with different hardware (single GPU, multi GPU, multi-node training) without changing much the main function.

noushinha commented 2 years ago

Thanks a lot @hoangtan96dl. As I mentioned I was running for a quick solution to see if I should continue working on the model for my custom dataset. That is why I chose to use a naive approach based on the low amount of knowledge I had about the used frameworks. I insist to repeat that my solution is neither a general, straightforward solution nor the best. From my point of view, with the whole storm of new frameworks that are released on monthly basis, it is not wise to sit and learn all of them. I have been developing on pytorch for a while and I could relate the pytorch lightning with it a bit. This is what had been educated on the website as well. What you explained is quite helpful in coming up with a smart solution for test evaluation. Now that I am sure further development might be useful on my data, I started by writing test_step(self, batch, batch_idx) which has a pseudocode like this:

test_outs = []
for test_batch in test_data:
    out = test_step(test_batch)
    test_outs.append(out)
test_epoch_end(test_outs)

thanks again for the repository. I always learn from others.

sindhura234 commented 2 years ago

Thank you @noushinha @hoangtan96dl I will try these out

VinAIResearch / 3D-UCaps

How to compute the metrics between testset predictions and true labels? #3