Bad matches running on GPU (related to non_blocking parameter)

Hi, first off thanks for your work and releasing it in such a "nicely packaged" format!

While I managed to resolve my issue (described below), I figured it would be useful to document it in case others encounter it as well. In addition, if you have any further insight into why this might be happening (perhaps on my machine configurations specifically) that would be appreciated as well.

Encountered issue

I was getting strange/incorrect outputs running LightGlue on GPU. Using the two images below and the match_pair function gives the following output: cup_bad_matches

When running on CPU instead, I get the following output: cup_good_matches

The code used for this minimal example is the following:

import matplotlib.pyplot as plt
import torch

from lightglue import LightGlue, SuperPoint, viz2d
from lightglue.utils import load_image, match_pair

torch.set_grad_enabled(False)

# load models
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # or just "cpu" for the second example
extractor = SuperPoint(max_num_keypoints=2048).eval().to(device)
matcher = LightGlue(features="superpoint").eval().to(device)

print(f"Using device: {device}")

# load images
image0 = load_image("cup_image_0.jpg")
image1 = load_image("cup_image_1.jpg")

# extract features + correspondences
feats0, feats1, matches01 = match_pair(
    extractor, matcher, image0.to(device), image1.to(device), non_blocking=True
)
kpts0, kpts1, matches = feats0["keypoints"], feats1["keypoints"], matches01["matches"]
m_kpts0, m_kpts1 = kpts0[matches[..., 0]], kpts1[matches[..., 1]]

# visualize results
viz2d.plot_images([image0, image1])
viz2d.plot_matches(m_kpts0, m_kpts1, color="lime", lw=0.2)
viz2d.add_text(0, f'Stop after {matches01["stop"]} layers')
plt.show()

Possible solutions

I eventually figured out that this was caused by the batch_to_device function called by match_pair, or more specifically the non_blocking=True parameter. The three solutions I found are:

Not using match_pair (as is e.g. done in the demo notebook), and moving the outputs I wanted to use to CPU "manually"
Setting non_blocking=False (the default)
Adding the following two lines after the call to match_pair (in the minimal example code above):
```
stream = torch.cuda.current_stream()
stream.synchronize()
```

Input data

`cup_image_0.jpg`	`cup_image_1.jpg`

Environment info

Ubuntu 22.04.3 (WSL)
conda environment with Python 3.10, torch==2.0.1, torchvision==0.15.2
CUDA version 12.2, NVIDIA driver version 537.13
NVIDIA Quadro P620

cvg / LightGlue