cvlab-stonybrook / Scanpath_Prediction

Predicting Goal-directed Human Attention Using Inverse Reinforcement Learning (CVPR2020)
MIT License
97 stars 22 forks source link

Help with Sequence Score #24

Closed quangdaist01 closed 2 years ago

quangdaist01 commented 2 years ago

Hello, I am interested in your work, and I want to replicate the reported result first before performing further experiments (for a class project). The metrics.py contains functions to compute sequence scores, but as mentioned in #3, some clustering work must be done first. I have read the Sequence Score algorithm, but I have no idea how to perform it. Can you provide some more materials on computing the metric? Thank you for reading!

ouyangzhibo commented 2 years ago
import numpy as np
from sklearn.cluster import MeanShift, estimate_bandwidth

def scanpath2clusters(meanshift, scanpath):
    string = []
    xs = scanpath['X']
    ys = scanpath['Y']
    for i in range(len(xs)):
        symbol = meanshift.predict([[xs[i], ys[i]]])[0]
        string.append(symbol)
    return string

def improved_rate(meanshift, scanpaths):
    Nc = len(meanshift.cluster_centers_)
    Nb, Nw = 0, 0
    for scanpath in scanpaths:
        string = scanpath2clusters(meanshift, scanpath)
        for i in range(len(string)-1):
            if string[i]==string[i+1]:
                Nw += 1
            else:
                Nb += 1
    return (Nb-Nw)/Nc

xs, ys = [], []
for scanpath in scanpaths:
    xs += list(scanpath['X'])
    ys += list(scanpath['Y'])

gt_gaze = np.concatenate((np.vstack(xs), np.vstack(ys)), axis=1)
bandwidth = estimate_bandwidth(gt_gaze)
rates = []
factors = [0.25, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0]
for factor in factors:
    bd = bandwidth*factor
    ms = MeanShift(bandwidth=bd)
    ms.fit(gt_gaze)
    rate = improved_rate(ms, scanpaths)
    rates.append(rate)
rates = np.vstack(rates)

best_bd = factors[np.argmax(rates)]*bandwidth
best_ms = MeanShift(bandwidth=best_bd)
best_ms.fit(gt_gaze)

# save best_ms for evaluation
gt_strings = []
for gt_scanpath in scanpaths:
    gt_string = scanpath2clusters(best_ms, gt_scanpath)
    gt_strings.append(gt_string)

Sequence score with interaction rate: https://www.cv-foundation.org/openaccess/content_iccv_2013/papers/Borji_Analysis_of_Scores_2013_ICCV_paper.pdf Sequence score with improved interaction rate: https://www-users.cs.umn.edu/~qzhao/publications/pdf/jiang_tnnls16.pdf

In practice, I use the bandwidth b_estimated estimated by sklearn (can be found in example), then I try b=b_estimated*scale_i, scale_i = 0.2, 0.5, 0.8, 1.0, 1.2, 1.5, 1.8 and select the one with the highest improved interaction rate. Checkout the example: https://scikit-learn.org/stable/auto_examples/cluster/plot_mean_shift.html#sphx-glr-auto-examples-cluster-plot-mean-shift-py

Input: sequences of fixations on one image

  1. You need some functions: a. all_fixations = scanpaths2fixations(sequences of fixations), just expand all fixations b. clusters = meanshift(all_fixations, bandwidth), can be found in the example c. strings = scanpaths2strings(sequences of fixations, clusters), can be found in the example d. score = strings2score(strings), you already know how to do this for one subject, just do it for all subjects
  2. for b in b_estimated*scales: process functions b, c, d to get a score, select b with the highest score
  3. save clusters with the selected bandwidth b* and ground truth strings so you can evaluate a new scanpath easily using function c and string comparison algorithms.
quangdaist01 commented 2 years ago

Thank you very much! Have a great day!

StoyanVenDimitrov commented 2 years ago

Hi, I try to verify the human oracle sequence score you provide in the paper (0.490), but got way higher score of 0.678. I was able to reproduce your MultiMtach score, so the problem should not be in the data I use. I do the following:


def compute_clusters(gt_scanpaths):
    xs, ys = [], []
    for scanpath in gt_scanpaths:
        xs += list(scanpath['X'])
        ys += list(scanpath['Y'])

    gt_gaze = np.concatenate((np.vstack(xs), np.vstack(ys)), axis=1)
    bandwidth = estimate_bandwidth(gt_gaze)
    rates = []
    factors = [0.2, 0.5, 0.8, 1.0, 1.2, 1.5, 1.8] #[0.25, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0]
    for factor in factors:
        bd = bandwidth*factor if bandwidth > 0.0 else None 
        ms = MeanShift(bandwidth=bd)
        ms.fit(gt_gaze)
        rate = improved_rate(ms, gt_scanpaths)
        rates.append(rate)
    rates = np.vstack(rates)

    best_bd = factors[np.argmax(rates)]*bandwidth if bandwidth > 0.0 else None
    best_ms = MeanShift(bandwidth=best_bd)
    best_ms.fit(gt_gaze)

    gt_strings = []
    subjects = []
    for gt_scanpath in gt_scanpaths:
        gt_string = scanpath2clusters(best_ms, gt_scanpath)
        gt_strings.append(gt_string)
        subjects.append(gt_scanpath['subject'])

    return best_ms, gt_strings, subjects

Could it be that you changed something after you published the paper and the 0.490 is not the one you got with the current code provided? Or am I doing something wrong with the clusters? Thank you!

ouyangzhibo commented 2 years ago

Is it because of the clusters you computed for sequence score? Can you verify it by using the provided clusters?

StoyanVenDimitrov commented 1 year ago

Where can I find them? I don't see any clusters.npy as mentioned here, even in older commits.

ouyangzhibo commented 1 year ago

Where can I find them? I don't see any clusters.npy as mentioned here, even in older commits.

Please find it at https://drive.google.com/file/d/1_NDKSb2JbqbDkL3RHh24MOhrroBjkIyK/view?usp=sharing. Note that it also contains target-absent fixation clusters.

StoyanVenDimitrov commented 1 year ago

Thank you. I found two things: first, I got the score of .490 with the human oracle only if I don't skip the evaluation of a trajectory with itself, meaning with the same subject. Which shouldn't happen, I guess. Second, the clusters are really different than mine, but I don't understand why they look like this. E.g. for 'test-present-bottle-000000547875' I get a string for the subject 2 scanpath of [0, 3, 1, 1], while your string is [2, 13, 5, 0, 3, 3, 0, 0, 1], but the scanpath for 000000547875.jpg, subject 2 in the cocosearch18 test data is "X": [834.2,817.3,1181.0,1329.5 ], "Y": [ 531.0,180.6,160.8,264.4]. The duration list T however has 9 elements? How do you get then a string of len 9?

ouyangzhibo commented 1 year ago
  1. Yes, you shouldn't compare a scanpath against itself.
  2. Thanks for bringing this up. It turned out that in the raw fixation test.json file we mistakenly removed the fixations after fixated the target bounding box for the first time (it is for the sake of training tho). I've updated the test.json file. So it is correct that there is 9 fixations in total for 'test-present-bottle-000000547875'.
StoyanVenDimitrov commented 1 year ago

The only way I got your reported result of 0.490 was with the old test.json file, which you mean is corrupted, and when allowing to compare a scanpath against itself. With the new file and your clusters I get 0.527 when allowing to compare a scanpath against itself, which is wrong, and 0.476 if I don't. None of this is the reported score. Apart from that, the clusters I compute with the above script are different than the ones you provide. The MultiMatch score using the test.json now is only roughly the same as the one you reported: [0.92444455 0.7370559 0.89802225 0.921154 ].

Can you provide a evaluation script where we can reproduce your scores and be sure we are doing everything right when using new data? Thank you

ouyangzhibo commented 1 year ago

The original json file is not corrupted and we used that for computing the human consistency. We removed the fixations after the viewer first fixated at the target for implementing a manual stopping criteria (i.e., stop searching when a fixation hitting the target). As for data release, we want to also include the fixations after hitting a target which might be interesting for other researchers.

I think you are doing the right thing. In the original paper, we included the cases of comparing against itself, which is wrong, leading to a human consistency of 0.490 in sequence score. Thank you so much for pointing that out!

StoyanVenDimitrov commented 1 year ago

Thank you, but still I cannot reproduce your clustering. E.g. for subject 2 for a specific image 000000547875.jpg the computed cluster is [3, 19, 7, 0, 10, 4, 0, 2, 1] and the give one is [2, 13, 5, 0, 3, 3, 0, 0, 1], apparently assigning different clusters there, where yours assigns the same.

Can you verify the script I posted above does what you did to get the clusters, including the list of factors [0.2, 0.5, 0.8, 1.0, 1.2, 1.5, 1.8]?

ouyangzhibo commented 1 year ago

Can you verify the script I posted above does what you did to get the clusters, including the list of factors [0.2, 0.5, 0.8, 1.0, 1.2, 1.5, 1.8]?

Yes, but please note that a) we used the new .json file with fixations after fixating at the target to do the clustering; b) as you may see in the provided cluster.npy, target-absent fixations are also included when performing the clustering.

StoyanVenDimitrov commented 1 year ago

thank you. I also use the new .json, so the strings length is the same as in the pre-computed clusters. But still, there are some small differences in the obtained strings. In addition, you compute strings also for the subjects with "fixOnTarget": false and "correct": 0, which I excluded from evaluation. Also, why computing clusters for target-absent fixations should make any difference? The clusters are computed per image, so they are independent from each other, right?