Topographic Similarity ignores distance functions

Rotem-BZ commented 1 year ago

https://github.com/facebookresearch/EGG/blob/72549d8f0b033b7905cc0a93b045ddfcbc9b4e07/egg/core/language_analysis.py#L209

I believe the arguments self.sender_input_distance_fn and self.message_distance_fn should be passed to the compute_topsim function in the attached line 209. Currently when used as a Callback (by instantiating the class), this code uses the default distance metrics instead of the ones given in __init__.

robertodessi commented 1 year ago

Hi @Rotem-BZ,

True, good catch! Here it should also be self.meaing_distance_fn https://github.com/facebookresearch/EGG/blob/72549d8f0b033b7905cc0a93b045ddfcbc9b4e07/egg/core/language_analysis.py#L185

and also in line 190. Do you want to open a PR?

Rotem-BZ commented 1 year ago

Sure. I also have an idea for a useful addition, a parameter N_interactions that will limit the amount of interactions used to calculate Spearman correlation (using the entire train/validation set is almost always infeasible). Should I add it within the same PR?

robertodessi commented 1 year ago

I've seen it mostly used as a metric for the validation/test set. How would you select the epochs though? We only expose the epoch number, would you just randomly select some? I think the only problem is storing, keeping all train data store often leads to cpu oom in my experience

Rotem-BZ commented 1 year ago

My idea is to take the last N interactions from each epoch / validation set and calculate their topsim. Also, I agree that holding the entire trainset in memory might be a problem, but isn't this already the case in the Interaction class?

https://github.com/facebookresearch/EGG/blob/72549d8f0b033b7905cc0a93b045ddfcbc9b4e07/egg/core/trainers.py#L262

Even when used on a relatively small validation set, with just a few hundred samples, if a computationally demanding distance metric is chosen (e.g. if you compare images using a deep image encoder) the scipy.spatial.distance.pdist call takes too long. To get a sense of the topsim progression throughout training, it may be sufficient to use fewer interactions every time, so that's the functionality I want to add.

robertodessi commented 1 year ago

Yes but assuming you set LoggingStrategy right you don't have a problem with the train set https://github.com/facebookresearch/EGG/blob/main/egg/core/interaction.py#L37-L46. In general this and similar things are left for the user to decide.

Do you have some numbers no how slow the computation is? I've computed the topsim using cosine similarity for a few thousand image embeddings (extracted from a deep net) and it was reasonably fast

Rotem-BZ commented 1 year ago

Oh you're right, so computing topsim over the trainset will only be possible when the logging strategy allows it.

Scipy's pdist works fast with their predefined metrics (such as cosine), the problem arises when you pass a udf. Here is an example with just 100 samples and a metric that takes 0.01 seconds to compute.

def metric(x1, x2):
    time.sleep(0.01)
    return 1.0
X = np.random.rand(100,1)

t0 = time.perf_counter()
_ = scipy.spatial.distance.pdist(X, metric)
t1 = time.perf_counter()
print(f"pdist computed in {t1 - t0} sec")

t0 = time.perf_counter()
_ = scipy.spatial.distance.pdist(X, 'euclidean')
t1 = time.perf_counter()
print(f"euclidean pdist computed in {t1 - t0} sec")

The results are

pdist computed in 82.5547 sec
euclidean pdist computed in 0.002850 sec

Perhaps a better solution here is to have a transform function that is applied on logs.sender_input before calling self.compute_topsim. This way we avoid calling pdist with a udf.

robertodessi commented 1 year ago

I'm not sure what would you use transform for? Additionally, it seems pdist default distance is the euclidean one.

In general I think the tradeoff is between a clean topsim callback interface and leaving to the user to do things like inheriting from the Topsim class and implementing some custom behavior. What do you think?

Rotem-BZ commented 1 year ago

Agreed. I will submit a PR with just the initial implementation fix 👍

robertodessi commented 1 year ago

Closed with #251 feel free to open a new issue for other things to add and thanks for fixing this!

facebookresearch / EGG

Topographic Similarity ignores distance functions #250