Calculations nDCG using GPU are 2x slower than CPU

donglihe-hub commented 8 months ago

🐛 Bug

Hi TorchMetrics Team,

In the following example, nDCG calculation using GPU tensors spent 2 times longer the time using CPU tensors and numpy array.

To Reproduce

The codes were tested on both Google Colab and a Slurm cluster.

Code sample

```python import timeit import numpy as np import torch from sklearn.metrics import ndcg_score from torchmetrics.functional.retrieval import retrieval_normalized_dcg # p and t are examples given by both sklearn and torchmetrics p = [.1, .2, .3, 4, 70] * 100 t = [10, 0, 0, 1, 5] * 100 number = int(1e4) # 1. BENCHMARK: numpy array preds = np.asarray([p]) target = np.asarray([t]) def a(): return ndcg_score(target, preds) print(f'numpy array: {timeit.timeit("a()", setup="from __main__ import a", number=number):.4f}') # 2. cpu tensor preds_cpu = torch.tensor(p) target_cpu = torch.tensor(t) assert preds_cpu.device == torch.device("cpu") def b(): retrieval_normalized_dcg(preds_cpu, target_cpu) print(f'CPU tensor: {timeit.timeit(f"b()", setup="from __main__ import b", number=number):.4f}') # 3. gpu tensor preds_gpu = torch.tensor(p, device="cuda") target_gpu = torch.tensor(t, device="cuda") assert preds_gpu.device == torch.device("cuda:0") def c(): retrieval_normalized_dcg(preds_gpu, target_gpu) print(f'GPU tensor: {timeit.timeit("c()", setup="from __main__ import c", number=number):.4f}') ``` Results: ``` # Tesla T4 numpy array: 6.4896 CPU tensor: 5.8501 GPU tensor: 10.4120 ``` I also tested the codes on the Slurm Cluster I'm currently using, the GPU here is an A100. ``` numpy array: 3.8700 CPU tensor: 2.9305 GPU tensor: 7.7575 ```

Expected behavior

The performance of calculation using GPU tensors, if not superior, should be at least close to CPU tensors.

Environment

TorchMetrics version (and how you installed TM, e.g. conda, pip, build from source): 1.2.1 (pip)
Python & PyTorch Version (e.g., 1.0): Python 3.10.12 and 3.10.13, Torch 2.1.0 and 2.1.1
Any other relevant information such as OS (e.g., Linux): Ubuntu 22.04.3 LTS and Linux 5.4.204-ql-generic-12.0-19 x86_64

Additional context

Borda commented 8 months ago

nDCG calculation using GPU tensors spent 2 times longer the time using CPU tensors and numpy array

Thank you for bringing this up. Have you observed it also with other metric then NDCG?

donglihe-hub commented 8 months ago

Thank you for bringing this up. Have you observed it also with other metric then NDCG?

I only tested NDCG at the time submitting the issue. But now I understand the cause of the issue.

The inferior performance of GPU tensor results from the fact that the current implementation of NDCG does not utilize the parallel computation provided by GPU -- TorchMetrics NDCG only accept 1D tensor as inputs.

To prove my observation, I tried another metric, multilabel_precision. The results showed that calculation on GPU is faster than CPU when there are hundreds of instances. However, when there is only one instance, calculation on CPU is faster than on GPU.

Scripts for multilabel_precision performance test

import timeit

import torch
from torchmetrics.functional.classification import multilabel_precision

number = int(1e3)

# change 400 to 1 for comparison experiments
y_true = torch.randint(2, (400, 300))
y_pred = torch.randint(2, (400, 300))

# CPU tensor
target_cpu = y_true.clone().detach()
preds_cpu = y_pred.clone().detach()

assert target_cpu.device == torch.device("cpu")

def cpu():
    return multilabel_precision(preds_cpu, target_cpu, num_labels=300)

print(f'CPU tensor: {timeit.timeit("cpu()", setup="from __main__ import cpu", number=number):.4f}')

# GPU tensor
target_gpu = y_true.clone().detach().to(device="cuda")
preds_gpu = y_pred.clone().detach().to(device="cuda")

assert target_gpu.device == torch.device("cuda:0")

def gpu():
    return multilabel_precision(preds_gpu, target_gpu, num_labels=300)

print(f'GPU tensor: {timeit.timeit("gpu()", setup="from __main__ import gpu", number=number):.4f}')

# 400 instances Results:
CPU tensor: 3.6518
GPU tensor: 0.8089

# 1 instance Results:
CPU tensor: 0.1848
GPU tensor: 0.6217

Is there any special concern that torchmetrics.NDCG only accepts a single instance instead of a batch? If not, I suggest NDCG should accept batch inputs.

SkafteNicki commented 7 months ago

@donglihe-hub thanks for reporting this issue. Sorry for the long reply time from my side. I been looking at the implementation of our metric for a bit of time now and it is not correct that the implementation is not using parallel computations on GPU. Just because the input is 1D does not mean that the computations cannot be parallelized. For example doing a simple sum is equally fast, regardless of input is a 1d or 2d tensor.

Looking at the code, it seems that the operation that takes up most computational time is torch.unique used here. From small experiments, it seems that this operation alone is a bottleneck: the torch gpu implementation is ~15 times slower for large arrays.

I am not sure if we can actually optimize the code or the operations used in the ndcg metric does simply not parallelize that well on GPU. I try to investigate further.

hengdashi commented 4 months ago

Hi!

I'm running into the same issue where the ndcg metric calculation is taking too long and becomes impractical to use while training. Calculating ndcg metric for every step with tensor size around (8000, 40) [batch_size, list_size] takes 2s to complete, and is way higher than the model forward pass.

After looking into the metric class implementation, I believe it is not because of the torch.unique function but the fundamental design flaw of the RetrievalMetric. The RetrievalMetric class splits the input tensor with the indexes into a list of tensors and iterates sequentially over that list, which is very slow when the number of query groups is high.

The tensorflow ranking implementation of the nDCG metric with the same inputs only takes about 50ms to complete.

Borda commented 1 month ago

~15 times slower for large arrays

that is a huge difference

I believe it is not because of the torch.unique function but the fundamental design flaw of the RetrievalMetric. The RetrievalMetric class splits the input tensor with the indexes into a list of tensors and iterates sequentially over that list, which is very slow when the number of query groups is high.

@hengdashi would yu be interested and suggest a more efficient solution, sending a PR?

Lightning-AI / torchmetrics