DengpanFu / LUPerson

Unsupervised Pre-training for Person Re-identification (LUPerson)
235 stars 35 forks source link

model cannot discriminate two images of very different people #28

Open cookieclicker123 opened 2 months ago

cookieclicker123 commented 2 months ago

Hello, im using the ResNet50 MGN MSMT17 finetuned model ( im not sure what pretraining this has undergone i.e whether its luperson with 50, 101, or 152 depth)

And here is my test.py to test its performance:

import argparse import torch from PIL import Image from fastreid.config import get_cfg from fastreid.engine import DefaultPredictor from torchvision import transforms import cv2 import numpy as np

parser = argparse.ArgumentParser(description="Reid feature extractor CLI") parser.add_argument( "--model", required=True, help="Path to the fastreid model file", ) parser.add_argument( "--image1", required=True, help="Path to the cropped file containing the first person's image", ) parser.add_argument( "--image2", required=True, help="Path to the cropped file containing the second person's image", )

args = parser.parse_args()

Load configuration and set device to CPU or MPS/GPU as required

cfg = get_cfg() cfg.merge_from_file("./bagtricks_R50.yml") cfg.MODEL.WEIGHTS = args.model

cfg.MODEL.DEVICE = "cpu" # Use "mps" for Apple M1/M2 or "cuda" for NVIDIA GPUs

predictor = DefaultPredictor(cfg) predictor.model.eval()

Load and preprocess the image

person1_image = args.image1 person2_image = args.image2

def process_cv2(file): image = cv2.imread(file) image = cv2.resize(image, (256, 128)) image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) image = image.astype(np.float32) / 255.0 image = torch.from_numpy(image).permute(2, 0, 1).unsqueeze(0) with torch.no_grad(): vector = predictor(image) vector = torch.nn.functional.normalize(vector, p=2, dim=1).squeeze().cpu().numpy() return vector

person1 = process_cv2(person1_image) person2 = process_cv2(person2_image)

similarity_score = np.dot(person1, person2) / (np.linalg.norm(person1) * np.linalg.norm(person2))

print(f"Similarity score opencv: {similarity_score}")

def process_pil(file): image = Image.open(file).convert("RGB") transform = transforms.Compose( [ transforms.Resize((256, 128)), transforms.ToTensor(), transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]), ] ) image = transform(image).unsqueeze(0) with torch.no_grad(): vector = predictor(image) vector = torch.nn.functional.normalize(vector, p=2, dim=1).squeeze().cpu().numpy() return vector

person1 = process_pil(person1_image) person2 = process_pil(person2_image)

similarity_score = np.dot(person1, person2) / (np.linalg.norm(person1) * np.linalg.norm(person2))

print(f"Similarity score PIL: {similarity_score}")

And my results:

warnings.warn(msg, RuntimeWarning) Similarity score opencv: 1.0000001192092896 Similarity score PIL: 0.9999998211860657

It almost seems like the model is simply ciassfying and not actually performing discrimination intra-class for people , but just whether they are people full stop.

These are the two configs im using if they are helpful for context:

--(bagtricks_R50.yml):

BASE: ./Base-bagtricks.yml

DATASETS: NAMES: ("MSMT17",) TESTS: ("MSMT17",)

OUTPUT_DIR: logs/MSMT17/bagtricks_R50

--(Base-bagtricks.yml):

MODEL: META_ARCHITECTURE: Baseline

BACKBONE: NAME: build_resnet_backbone NORM: BN DEPTH: 50x LAST_STRIDE: 1 FEAT_DIM: 2048 WITH_IBN: False PRETRAIN: True

HEADS: NAME: EmbeddingHead NORM: BN WITH_BNNECK: True POOL_LAYER: GlobalAvgPool NECK_FEAT: before CLS_LAYER: Linear

LOSSES: NAME: ("CrossEntropyLoss", "TripletLoss",)

CE:
  EPSILON: 0.1
  SCALE: 1.

TRI:
  MARGIN: 0.3
  HARD_MINING: True
  NORM_FEAT: False
  SCALE: 1.

INPUT: SIZE_TRAIN: [ 256, 128 ] SIZE_TEST: [ 256, 128 ]

REA: ENABLED: True PROB: 0.5

FLIP: ENABLED: True

PADDING: ENABLED: True

DATALOADER: SAMPLER_TRAIN: NaiveIdentitySampler NUM_INSTANCE: 4 NUM_WORKERS: 8

SOLVER: AMP: ENABLED: True OPT: Adam MAX_EPOCH: 120 BASE_LR: 0.00035 WEIGHT_DECAY: 0.0005 WEIGHT_DECAY_NORM: 0.0005 IMS_PER_BATCH: 64

SCHED: MultiStepLR STEPS: [ 40, 90 ] GAMMA: 0.1

WARMUP_FACTOR: 0.1 WARMUP_ITERS: 2000

CHECKPOINT_PERIOD: 30

TEST: EVAL_PERIOD: 30 IMS_PER_BATCH: 128

CUDNN_BENCHMARK: True

If anyone can clear up whether this is expected and the models ability to generalise just isnt very good. I dobut this if it was able to acheive such good results on a large reid dataset like MSMT17.

More likely if anyone can help me figure out what im doing wrong so that these scores are much lower, as of course the similairty between the people in the msmt17 dataset had to be understood to achieve those results, so i must be doing something wrong in inference for even torchreid using osnet which is objectively an inferiror model attains 73%, so something is going wrong. I appreciate anyone who can shed light on this, thanks

sickMaro commented 2 months ago

hi, have you solved the problem ?

cookieclicker123 commented 2 months ago

hi, have you solved the problem ?

no i have not unfortunately. same issues with SOLIDER, yet getting great performance with OSNet and vision transformer pretrained on imageNet21k and finetuned on imageNet. Seems strange that LUPerson and SOLIDER which were designed for reid would perform so seemingly poorly yet attain very high metrics on the benchmarks. Possibly generalisation is the simple answer but i doubt it is that. Hopefully someone knows