PSAC Indirect result is not reproduced.

ShramanPramanick commented 1 year ago

Hi, I evaluated PSAC in indirect setting. I trained the model for 70 epochs with a batch size of 1024. All other parameters are the same as train_psac.py. However, the results of the test set are not close to the reported result (42.25 overall).

I am attaching my training and evaluation log here. It would be very helpful if you kindly look into this and provide the best model which produced the reported SOTA numbers.

Buzz-Beater commented 1 year ago

Hi, I have fixed the error website link for README.md and there should be another checkpoints folder in the data download g-drive link now (w/o clipbert checkpoints). Let me know if there are still other questions.

ShramanPramanick commented 1 year ago

Thanks for providing the checkpoints. I tried to evaluate the PSAC indirect checkpoint (MODELS/splits/ori/indirect/psac_model_44000.tar) using this repository, and I got the following error:

_RuntimeError: Error(s) in loading state_dict for FrameQAModel: size mismatch for ques_encoder.wordemb.weight: copying a param with shape torch.Size([257, 300]) from checkpoint, the shape in current model is torch.Size([283, 300]).

I guess you used some other embeddings. Will you please help me to correct this? Thanks.

Buzz-Beater commented 1 year ago

Thanks for spotting the issue and sorry for the delay. We have found a random seed problem that caused the current version (post-rebuttal refinement) of indirect QAs to be slightly different from the original split during submission. We have uploaded the original indirect split to the google drive in data/previous_version. This version should match the provided checkpoints.

Meanwhile, to make sure that your experiments are correct, we re-run all experiments on the provided version for PSAC, and the results are as follows:

all	open	binary	action	object	state	change
32.72	15.31	57.75	29.00	9.22	29.01	49.65

descriptive	predictive	counterfactual	explanatory	world	intent	multiagent
32.03	25.63	35.90	33.78	36.30	25.22	21.61

We will update other model performances for the post-rebuttal version accordingly after experiments. Feel free to use the pre-review version with the original checkpoints or your reproduced ones on the post-rebuttal version if you find the scores listed above similar to your experiment results. Hope this helps and feel free to send me an email via baoxiongjia@g.ucla.edu for faster response.

ShramanPramanick commented 1 year ago

Thanks for your detailed response and sorry for the delayed reply. The clarification about indirect split surely helps. I have again trained PSAC on the "original indirect split" from data/previous_version, and the best overall accuracy is ~38%, which is still significantly lower than the reported score (42.25%). Moreover, these two different versions of indirect split may cause confusion for the future researchers who will be using this dataset. Hence, it would be nice if you clearly mention about the existence of two different indirect split versions, and report the performance of all baselines on both. (if modifying the paper is not possible, at least in the repository)

Buzz-Beater / EgoTaskQA

PSAC Indirect result is not reproduced. #3