Results are not quite matched with the manuscript in the NMSE and PSNR scores.

facebookresearch / fastMRI

A large-scale dataset of both raw MRI measurements and clinical MRI images.

https://fastmri.org

MIT License

1.3k stars 372 forks source link

Results are not quite matched with the manuscript in the NMSE and PSNR scores. #129

Closed lpzhang closed 3 years ago

lpzhang commented 3 years ago

Great work in accelerating MRI. Thanks for releasing the first large-scale raw MRI dataset and benchmarks with source codes. We tried to reproduce benchmarks results using the provided code. Compared to the results mentioned in Table 8, the Single-coil U-Net baseline applied to knee validation data, our experiment results are quite different from the benchmarks in terms of the NMSE and PSNR metrics. The Details are listed as follows:

Results of our experiments achieved the same SSIM score but disagreed in NMSE and PSNR scores. I am very curious about such divergence and kindly asking if you would like to help me solve the puzzle. Thank you very much.

anuroopsriram commented 3 years ago

@lpzhang Thanks for flagging this. The code has changed a lot in the last few months and there may have been a regression. We haven't trained single-coil models in a long time. Can you try using an old commit (say, https://github.com/facebookresearch/fastMRI/tree/31d6e737efcd9115cb0f028e36feb9e4422669c2)?

mmuckley commented 3 years ago

I'm skeptical of a regression. The current code generated better reconstructions for all metrics on the test data and this can be seen by my submission for the single-coil knee on the public leaderboard.

mmuckley commented 3 years ago

@lpzhang In addition to Anuroop's code, could you let me know the exact code you ran from the current repository? (commit hash, link to file, hyperparameters, etc.).

lpzhang commented 3 years ago

@anuroopsriram Thanks for your reply and advice. I will try to run single-coil models from the old commit and give you feedback.

lpzhang commented 3 years ago

@mmuckley Thanks for your reply. The code for training the single-coil models is based on the latest commit: https://github.com/facebookresearch/fastMRI/tree/4254a6aacd920fd0c30e1b8876614e24fb51fb82

Scripts for training are:

fastmri_examples/unet/train_unet_demo.py
fastmri_examples/unet/unet_knee_sc_leaderboard_20201111.py

Config: Instead of using 32 GPUs with total amounts of 32 batches as in the original scripts, we trained models with much less GPUs but with the same amounts of batch size (say 2 GPUs with total amounts of 32 batches). Such a config will slow down the training, but will not affect the accuracy since the instance normalization is used. Results with different GPU numbers, batch size, and epoch are listed as follows.

mmuckley commented 3 years ago

@lpzhang Are you using the validation loop in the Lightning module to calculate these metrics, or are you generating HDF5 files for two datasets and using evaluate.py?

lpzhang commented 3 years ago

@mmuckley All the above results are from the validation loop during training. Is there any difference between the two methods?

mmuckley commented 3 years ago

Yeah so after your issue I checked it and I think there are some bugs with respect to aggregation. Both are done on a slice basis. NMSE uses the norm of the ground truth for normalization, but right now it uses the slice norm rather than the volume norm. PSNR requires the maximum value, and it currently uses the maximum value of the slice rather than the volume.

We mostly use SSIM so SSIM is pretty well-debugged at this point to handle slice aggregation (thanks to @z-fabian), but it looks like we still have issues with NMSE and PSNR. I'm sorry about this.

I have a fix in the works and will put in a pull request if I can verify PSNR NMSE/PSNR better than the paper.

evaluate.py should be bug free, as all inputs are image volumes and so the maximum value and the volume norm will be calculated correctly.

mmuckley commented 3 years ago

Okay I think this is fixed in PR #130. On my Tensorboard the metrics from that PR are much closer to those on the public leaderboard. Let me know if it works for you, @lpzhang.

lpzhang commented 3 years ago

@mmuckley Thanks for your reply. It makes sense now if the maximum value is not over the volume. I didn't notice that before since the validation data uses the VolumeSampler. I am sorry about this. Please keep me posted. Thanks again.

lpzhang commented 3 years ago

@mmuckley Thank you very much for solving the puzzle. I will let you know if there are any updates.

lpzhang commented 3 years ago

@anuroopsriram @mmuckley Thank you very much. The updated aggregation method in the validation loop #130 works fine. The generated reconstructions are even slightly better than the original manuscript across all metrics in the single-coil knee task.

mmuckley commented 3 years ago

@lpzhang That's great news. Let us know if you encounter further issues.

For those wondering why performance is better, it could be PR #123.