Question on training on different dataset

kennyvoo commented 10 months ago

I attempted to train a model using the IITB-corridor dataset following a similar approach used for the Avenue dataset. However, the training results are suboptimal, achieving only around 61-62% accuracy. Can you please give me some advice on how to improve it? Is there any parameters that I need to take note of? Is the reason due to the larger size of images? (distortion caused by resizing it to 256 directly). I'm currently running another run (cropped 1 region of the image) to verify this.

For training, I've followed the same Train configuration of avenue dataset, (by changing only the background image) L 384 of final_future_prediction_avenue.py self.bkg = nn.Parameter(torch.from_numpy(np.repeat(cv2.imread('./bkg_iitb.jpg', 0).copy()[:, :, None], 3, axis=-1).transpose((2, 0, 1)).astype(np.float32)/127.5 - 1))

for evaluation, I've removed

        dif = ((imgs[:, -3:]-outputs[0]).abs().squeeze(0).cpu().numpy().transpose((1,2,0))*127.5).astype(np.uint8)
        cv2.imwrite("./exp/"+str(k)+'_0.jpg', dif)
        grad_list_x.append(outputs[6][:,:,:,1].cpu().numpy().astype(np.float16)) # 8bit is ok

# if not ifTraining:
#     exp_offset = np.concatenate(grad_list_x, axis=0)
#     exp_offset = exp_offset / np.abs(exp_offset).max()
#     np.save("./exp/offset8.npy", (exp_offset * 127).astype(np.int8))
#     print("please used post-precessing to remove static novel instance and evaluate the final auc")
#     return

and add the following lines following ped2

def conf_avg(x, size=11, n_conf=5):
    a = x.copy()
    b = []
    weight = np.array([1, 1, 1, 1, 1.2, 1.6, 1.2, 1, 1, 1, 1])

    for i in range(x.shape[0] - size + 1):
        a_ = a[i:i + size].copy()
        u = a_.mean()
        dif = abs(a_ - u)
        sot = np.argsort(dif)[:n_conf]
        mask = np.zeros_like(dif)
        mask[sot] = 1
        weight_ = weight * mask
        b.append(np.sum(a_ * weight_) / weight_.sum())
    for _ in range(size // 2):
        b.append(b[-1])
        b.insert(0, 1)
    return b

anomaly_list1 = conf_avg(np.array(anomaly_list1))
anomaly_list2 = conf_avg(np.array(anomaly_list2))
anomaly_list3 = conf_avg(np.array(anomaly_list3))

kennyvoo commented 10 months ago

I used a subset of train and test videos to verify my hypothesis (image too large, too much information loss when resized). For another run, I cropped the main region (h=600, w =800) to train and test. The accuracy does increased to 83+. Beside this method, is there other way to improve the accuracy?

setting larger size(more than 256)?
Another question is how to set the msize?
increasing the interval of the frame (4 to 8)

FlappyPeggy commented 10 months ago

Large-size input may leads to unstable prototypical learning and suppresses its representation capacity. This may be caused by VQ-VAE (check the observation and solution in FSQ: https://arxiv.org/abs/2309.15505).
The size of receptive field and the max_offset limitation in $\psi$ may not enough, making it hard to generate large deformation.
The background / light condition is not fixed. Try the training code about background-selection used in ShangHaiTech.

kennyvoo commented 10 months ago

Thank you for your prompt response. I have made adjustments to the code based on the ShangHaiTech dataset, where each video has its own background. However, it appears that the performance is not as satisfactory as the version modified from the Avenue dataset.

Regarding your second piece of advice, I am seeking guidance on determining appropriate values. Specifically, I am interested in understanding the ideal receptive field size and offset. In a scenario where a human occupies approximately 1/4 of the image in the scene (256,256), what would be the recommended receptive field size and its corresponding offset? Should it cover the entire human size?

FlappyPeggy commented 10 months ago

Suggestion about "ShanghaiTech" means more memory slots, e.g.1000, and you could try extra loss used in shanghaitech.
Large-scale image may lead the existing grad_loss hard to constrain smoothness. This is why I use a small $maxoffset$.
- There is no recommended receptive field size, but $\psi$ should be deeper as the image size increasing.
- It is recommended that $maxoffset_1 > \mathbb{E} [person\ movement]$ (depends on maximum movement between two frames of the same person) and $maxoffset_2 > max[\Vert person\ movement\Vert ] - \mathbb{E} [person\ movement]$ (depends on maximum deformation between standard/reference person and its diverse action, e.g. hand or foot).

FlappyPeggy / DMAD

Question on training on different dataset #10