Closed Rotem-BZ closed 1 year ago
Hi @Rotem-BZ,
True, good catch! Here it should also be self.meaing_distance_fn
https://github.com/facebookresearch/EGG/blob/72549d8f0b033b7905cc0a93b045ddfcbc9b4e07/egg/core/language_analysis.py#L185
and also in line 190. Do you want to open a PR?
Sure. I also have an idea for a useful addition, a parameter N_interactions
that will limit the amount of interactions used to calculate Spearman correlation (using the entire train/validation set is almost always infeasible). Should I add it within the same PR?
I've seen it mostly used as a metric for the validation/test set. How would you select the epochs though? We only expose the epoch number, would you just randomly select some? I think the only problem is storing, keeping all train data store often leads to cpu oom in my experience
My idea is to take the last N interactions from each epoch / validation set and calculate their topsim. Also, I agree that holding the entire trainset in memory might be a problem, but isn't this already the case in the Interaction class?
Even when used on a relatively small validation set, with just a few hundred samples, if a computationally demanding distance metric is chosen (e.g. if you compare images using a deep image encoder) the scipy.spatial.distance.pdist
call takes too long. To get a sense of the topsim progression throughout training, it may be sufficient to use fewer interactions every time, so that's the functionality I want to add.
Yes but assuming you set LoggingStrategy right you don't have a problem with the train set https://github.com/facebookresearch/EGG/blob/main/egg/core/interaction.py#L37-L46. In general this and similar things are left for the user to decide.
Do you have some numbers no how slow the computation is? I've computed the topsim using cosine similarity for a few thousand image embeddings (extracted from a deep net) and it was reasonably fast
Oh you're right, so computing topsim over the trainset will only be possible when the logging strategy allows it.
Scipy's pdist
works fast with their predefined metrics (such as cosine), the problem arises when you pass a udf. Here is an example with just 100 samples and a metric that takes 0.01 seconds to compute.
def metric(x1, x2):
time.sleep(0.01)
return 1.0
X = np.random.rand(100,1)
t0 = time.perf_counter()
_ = scipy.spatial.distance.pdist(X, metric)
t1 = time.perf_counter()
print(f"pdist computed in {t1 - t0} sec")
t0 = time.perf_counter()
_ = scipy.spatial.distance.pdist(X, 'euclidean')
t1 = time.perf_counter()
print(f"euclidean pdist computed in {t1 - t0} sec")
The results are
pdist computed in 82.5547 sec
euclidean pdist computed in 0.002850 sec
Perhaps a better solution here is to have a transform
function that is applied on logs.sender_input
before calling self.compute_topsim
. This way we avoid calling pdist
with a udf.
I'm not sure what would you use transform
for? Additionally, it seems pdist
default distance is the euclidean one.
In general I think the tradeoff is between a clean topsim callback interface and leaving to the user to do things like inheriting from the Topsim class and implementing some custom behavior. What do you think?
Agreed. I will submit a PR with just the initial implementation fix 👍
Closed with #251 feel free to open a new issue for other things to add and thanks for fixing this!
https://github.com/facebookresearch/EGG/blob/72549d8f0b033b7905cc0a93b045ddfcbc9b4e07/egg/core/language_analysis.py#L209
I believe the arguments
self.sender_input_distance_fn
andself.message_distance_fn
should be passed to thecompute_topsim
function in the attached line 209. Currently when used as a Callback (by instantiating the class), this code uses the default distance metrics instead of the ones given in__init__
.