MONAI's CRF takes too long on CPU

masadcv commented 3 years ago

Describe the bug I am running MONAI's CRF implementation on CPU on a 3D volume of size (120, 150, 100). It takes 177.5296 sec to run on CPU using MONAI's implementation The same can be achieved in 5.8184 sec using SimpleCRF's implementation from: https://github.com/HiLab-git/SimpleCRF

I have setup a test script to replicate this here: https://gist.github.com/masadcv/84f1bc9f505056ea8f4290d14a002d2a

It also seems the case that the MONAI's implementation takes significantly more memory on CPU as compared to SimpleCRF. Not sure if that is expected, but may be worth investigating if possible.

To Reproduce Steps to reproduce the behavior:

Download test script from: https://gist.github.com/masadcv/84f1bc9f505056ea8f4290d14a002d2a
Install MONAI with BUILD_MONAI=1 BUILD_MONAI=1 pip -q install git+https://github.com/Project-MONAI/MONAI#egg=monai
Install other required packages pip install simplecrf nibabel wget
Run commands 'python testscript.py'

Expected behavior I expect the two implementations (MONAI CRF vs SimpleCRF) to be in the same/similar ballpark in terms of execution time. At the moment, MONAI's implementation seems orders of magnitude slower.

Environment

Ensuring you use the relevant python executable, please paste the output of:

python -c 'import monai; monai.config.print_debug_info()'
================================
Printing MONAI config...
================================
MONAI version: 0.5.2+67.g013186d
Numpy version: 1.20.3
Pytorch version: 1.8.1+cu102
MONAI flags: HAS_EXT = True, USE_COMPILED = False
MONAI rev id: 013186dd9d0408026c38b4c7a75ee34e031b13d1

Optional dependencies:
Pytorch Ignite version: NOT INSTALLED or UNKNOWN VERSION.
Nibabel version: 3.2.1
scikit-image version: NOT INSTALLED or UNKNOWN VERSION.
Pillow version: 8.2.0
Tensorboard version: NOT INSTALLED or UNKNOWN VERSION.
gdown version: NOT INSTALLED or UNKNOWN VERSION.
TorchVision version: 0.9.1+cu102
ITK version: NOT INSTALLED or UNKNOWN VERSION.
tqdm version: NOT INSTALLED or UNKNOWN VERSION.
lmdb version: NOT INSTALLED or UNKNOWN VERSION.
psutil version: NOT INSTALLED or UNKNOWN VERSION.

For details about installing the optional dependencies, please visit:
    https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies

================================
Printing system config...
================================
`psutil` required for `print_system_info`

================================
Printing GPU config...
================================
Num GPUs: 1
Has CUDA: True
CUDA version: 10.2
cuDNN enabled: True
cuDNN version: 7605
Current device: 0
Library compiled for CUDA architectures: ['sm_37', 'sm_50', 'sm_60', 'sm_70']
GPU 0 Name: Quadro RTX 3000
GPU 0 Is integrated: False
GPU 0 Is multi GPU board: False
GPU 0 Multi processor count: 30
GPU 0 Total memory (GB): 5.8
GPU 0 CUDA capability (maj.min): 7.5

cc: @charliebudd @tvercaut

charliebudd commented 3 years ago

There's a quick fix I've been meaning to commit and then there's optimisation of the PHL message passing which is a longer job I'm working on in the background. Optimisation has mainly been focused on the GPU implementation with the CPU as a fall back, but I think it is reasonable to expect both be at a high performance.

tvercaut commented 3 years ago

Naive first question but is the c++ code compiled with optimisation on? I can't see anything like -O2 or -O3 in setup.py but I guess it may come from elsewhere.

charliebudd commented 3 years ago

We compile the C++ extention with torch's setup tools wrapper. I believe this handles these things, off the top of my head I think its -O2. This PR #2261 implements the quick fix I aluded to earlier. While I have not tested it against Simple CRF, it does now run at the same order of magnitude as the crf as rnn implementation, and produce identical (by eye) results. When the JIT system is in I'll move the PHL over to there and make my optimisations. The main one I've planned is to seperate the constuction of the lattice from the application of it. As the CRF iterates over the same PHL filter with the same features, this will mean we only need to construct it once.

tvercaut commented 3 years ago

Nice. I guess your comparison agains crf-as-rnn is in 2D. I guess crf-as-rnn works on 2D only out of the box, right? It would be worth checking against SimpleCRF in 3D especially.

I expect the runtime of SimpleCRF in 2D to be similar to crf-as-rnn.

The crf-as-rnn implementation use the code from Philipp Krähenbühl for the PHL: https://github.com/sadeepj/crfasrnn_pytorch/blob/master/crfasrnn/permutohedral.h but does the outer loop in python.

In 2D, SimpleCRF wraps the entire CRF code from Philipp Krähenbühl which includes the same PHL code: https://github.com/HiLab-git/SimpleCRF/tree/master/dependency/densecrf

In 3D, SimpleCRF warps Kostas Kamnitsas's extention of Philipp Krähenbühl CRF code: https://github.com/HiLab-git/SimpleCRF/tree/master/dependency/densecrf3d

vikashg commented 10 months ago

closing because of inactivity

Project-MONAI / MONAI

MONAI's CRF takes too long on CPU #2250