Sid2697 / EgoProceL-egocentric-procedure-learning

Code implementation for our ECCV, 2022 paper titled "My View is the Best View: Procedure Learning from Egocentric Videos"
25 stars 4 forks source link

questions about codes and data #4

Open ziyuwwang opened 1 year ago

ziyuwwang commented 1 year ago

Hi Sid, Thanks for your wonderful work! When I ran the code, I met a few questions:

  1. some annotations were missing. For example, "Head_9.csv" in pc_assembly. A similar situation also appeared in MECCANO;
  2. For CrossTask and ProceL, some videos cannot be found due to invalid urls;
  3. The function "generate_unique_video_steps" in "RepLearn/TCC/utils.py" seems wrong and there will be an error of tensor dimensions mismatch when running this function.

Can you check these issues and provide complete your datasets so that I can smoothly follow your work? Thanks a million!

Sid2697 commented 1 year ago

Hello @ziyuwwang,

Thanks for your interest in our work! Also, thanks for your patience, I was on leave for the past 3 weeks.

Here are the answers to your questions:

  1. Thanks for bringing this to my attention. I have added the missing files. Let me know if you're still unable to access them. I have added the file for pc_assembly. However, for MECCANO, there were 2-3 videos where the subject is unable to follow the steps provided and are unable to finish assembling the toy. Due to this, we have not annotated those exceptional videos.
  2. Can you please share some URLs that didn't work for you?
  3. Can you please share the exact error that you get along with the config that you were using? It will help me reproduce the error and debug.

Let me know if there are any other concerns!

ziyuwwang commented 1 year ago

Thanks for your response! Apart from the previous questions, I have some other problems:

  1. Do you use the average metrics or the overall metrics in your paper? I found these two are quite different.
  2. Do you set the hyperparameter "KMEANS_NUM_CLUSTERS" to be 7 for all the datasets? Is this hyperparameter the same as the hyperparameter "K" in your paper? It would be better if you can provide your configuration file for each dataset. As for the data problem, I will share some invalid URLs later. Hope to get your help!
ziyuwwang commented 1 year ago

Sorry, another question: I found the data transformation when evaluating models is also confusing. When training a model, input data is read from h5 files and then divided by 127.5 and subtracted by 1. This makes the value range of the input data (-1, 1). But when performing procedure learning (evaluating), the input data is read from h5 files and then converted into PIL Image formation. A normalization is followed and the values are still divided by 127.5 and subtracted by 1. These operations make the values become negative numbers with very small absolute values. I am not sure whether it is right. Can you check this for me?

Sid2697 commented 1 year ago

@ziyuwwang Thanks for your questions! Here are the answers:

Do you use the average metrics or the overall metrics in your paper? I found these two are quite different.

We use the average metrics for evaluating the models when we are not comparing with the previous works (Table 3 onwards). However, to make sure there is a fair comparison with the previous works, as mentioned in Section 5.2 in the paper, we use the overall metric when comparing in Table 2.

Do you set the hyperparameter "KMEANS_NUM_CLUSTERS" to be 7 for all the datasets? Is this hyperparameter the same as the hyperparameter "K" in your paper?

Yes, you're correct. KMEANS_NUM_CLUSTERS is 7 for all the dataset (as determined from Table 6 in the paper). We change this value when performing the hyper-parameter tuning (reported in Table 6).

It would be better if you can provide your configuration file for each dataset.

Thanks for the suggestion, I'll upload these files and notify you.

Doubts on data transformation.

Thanks for pointing this out, and apologies for the confusion here. In our experiments, we have used transforms = None here: https://github.com/Sid2697/EgoProceL-egocentric-procedure-learning/blob/01fa3df0bb2af718030ca22bfcea7bfa26110080/RepLearn/TCC/procedure_learning.py#L95

This makes the data pre-processing similar to what is performed during the training. The transformations are only applied by this method: https://github.com/Sid2697/EgoProceL-egocentric-procedure-learning/blob/01fa3df0bb2af718030ca22bfcea7bfa26110080/RepLearn/TCC/utils.py#L44

Let me know if there are further questions!

ziyuwwang commented 1 year ago

@Sid2697 Thank you for taking the time to answer my questions. But I really have many questions yet:

  1. Do you train and evaluate the model on the same dataset? For example, train and evaluate the model both on PC_Assembly.
  2. Based on the first question, if so, I cannot reproduce your performance reported in the paper. Can you provide a guide on how to reproduce the performances?
  3. I found some possible mistakes in the code. In line 67 of the "Replearn/TCC/models.py", why directly reshape the tensor? I think the right code should be "x = x.view(-1, self.num_context_steps, c, h, w).permute(0, 2, 1, 3, 4)" instead of "x = x.contiguous().view(-1, c, self.num_context_steps, h, w)". The current code will lead to the wrong division of tensor. I noticed that you bring this code from an unofficial code repository. Can you check it? I don't think this is the correct realization of the original TCC method.
  4. Maybe due to the third question, I found when setting the "num_context_steps" to 1, I got a much better result than setting it to numbers larger than 1. Meanwhile, I found that using an adaptive average pooling instead of a global 3D max pooling will lead to a much better result.
  5. About the context frames sampling, you sample the context frames from the future frames in training but from the past frames in evaluating. Is this right? In the original TCC method, I found they sample the context frames just from the past frames either in training or evaluation.

Hope you can resolve my questions. Thanks!

Sid2697 commented 1 year ago

Hello @ziyuwwang , thanks for your response! Here are my answers:

  1. Yes, as ours is a completely unsupervised approach, we train and evaluate on the same dataset following https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123620545.pdf (single task setting from their paper).
  2. Following the README to train and evaluate the models will generate the numbers reported in the paper. Can you please share the numbers you got?
  3. Can you please create an issue about this in the said repository? We can have a discussion with the original implementors there.
  4. I see! We can come back to this point once we resolve point 3.
  5. Yes you're correct. As we already have the entire video to ourselves, we can select context frames from the future too.

I wish to get deeper into the issue in point 3, however, I'm currently occupied with my PhD projects. Due to this, it will take me a while to have a look at the code and dig deeper. That is why I recommend you to please reach out to the original repository for a faster resolution.

Sid2697 commented 1 year ago

@ziyuwwang this paper might be of interest to you: https://arxiv.org/pdf/2301.00794.pdf