alexa / teach

TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.
135 stars 27 forks source link

Evaluating Episodic Transformer baselines for EDH instances gives zero successes #8

Open dv-fenix opened 2 years ago

dv-fenix commented 2 years ago

Hi!

I have been trying to replicate the results of the Episodic Transformer (ET) baselines for the EDH benchmark. The inference script runs without any errors but the ET baselines provided along with this repository give zero successes on all the validation EDH splits (both ['valid-seen', 'valid-unseen']).

This behavior can be replicated using the instructions in the ET root directory (found here), specifically the following script:

CUDA_VISIBLE_DEVICES=4,5,6,7 python3 src/teach/cli/inference.py \
    --model_module teach.inference.et_model \
    --model_class ETModel \
    --data_dir $ET_DATA \
    --images_dir $IMAGES_DIR \
    --output_dir $INFERENCE_OUTPUT_PATH/inference__teach_et_trial \
    --split valid_seen \
    --metrics_file $INFERENCE_OUTPUT_PATH/metrics__teach_et_trial.json \
    --seed 4 \
    --model_dir $ET_DATA/baseline_models/et \
    --num_processes 50 \
    --object_predictor $ET_LOGS/pretrained/maskrcnn_model.pth \
    --visual_checkpoint $ET_LOGS/pretrained/fasterrcnn_model.pth \
    --device "cuda"

I also tried training the basic ET baseline from scratch. Running the evaluation script on this model also leads to zero successes.

aishwaryap commented 2 years ago

Hi,

While we do get some variance in our performance I don't think I have ever got a success rate of 0. Could you upload the saved metrics file somewhere and share the URL?

Thanks! Aishwarya

dv-fenix commented 2 years ago

Hi @aishwaryap,

The metrics file can be found here. Please let me know if you need any other information from my side!

Thanks, Divyam

aishwaryap commented 2 years ago

Hi @dv-fenix ,

I've noticed a few things from your command and metrics file. First of all, I have a feeling that 50 processes is too much for most machines to handle. In our testing, we have always set the number of processes equal to the number of GPUs.

I also noticed by checking your metrics file that only 64 EDH instances have been evaluated (the metrics file is a json, if you load it into an object o and check o['traj_stats'] that provides metrics for each EDH instance). This probably means a lot of the processes you created were likely killed due to memory or other resource limits. It is also possible that some other error did crop up because the predicted actions for all those 64 EDH instances appears to be empty. This suggests that you are probably additionally running into some other error.

Since we have try-catch blocks around the _run_edh_instance function in inference_runner.py you are likely not seeing error traces of any errors you are running into. This is to ensure that when we run evaluation for the SimBot Offline Challenge, an error that comes up in one EDH instance does not prevent evaluation of the rest. However for development, it might be helpful for you to comment out the try-catch blocks in that function to see what errors are blocking you. It would be great if you can do another run with --num_processes 1 and after removing / commenting out the catch blocks in _run_edh_instance. If you are unable to figure out how to resolve the resultant error, please follow up on this thread. If you get no errors, it would be helpful to create a text file with as much of the terminal output as possible during your execution to help us identify the issue. I recommend trying out this debugging phase with a smaller version of the dataset containing say 5-10 EDH instances (you do need to mimic the original folder structure and have the corresponding game files and image folders). Given the low success rate of the ET baseline model, you will likely get a success rate of 0 on that sample, but it would be helpful to ensure that you are getting non empty predicted action sequences before running a full round of inference.

Hope this helps, Aishwarya

P.S.: I am on leave for most of this week so responses to issues may be slower than usual this week.

dv-fenix commented 2 years ago

Cool! I'll try it out and let you know how it goes.

aishwaryap commented 2 years ago

Hi @dv-fenix

In addition to using fewer processes, I recommend pulling the latest version of the code. There was a small bug that would result in non-Docker inference erroring out and it has been fixed now.

Best, Aishwarya

dv-fenix commented 2 years ago

Hi @aishwaryap

I ran inference on a pertained ET baseline model using --num_processes 1 on a small subset of EDH instances. I also removed the try-catch blocks around the _run_edh_instance function in inference_runner.py. The final results for the ET baseline are as follows:

-------------
SR: 0/8 = 0.000
GC: 2/38 = 0.053
PLW SR: 0.000
PLW GC: 0.009
-------------

There were no errors in the execution of the process. You can find the metrics file and the terminal output file in this folder.

Your initial thought on their being too many processes may be correct. I am thinking about running an inference on the entire data with --num_processes 2 on 4 GPUs to confirm this. Please let me know if you find something out of place in the metrics file or the terminal output, any insight will be helpful!

Thanks, Divyam

dv-fenix commented 2 years ago

Hi @aishwaryap

I tried running the inference on 4GPUs using --num_processes 2 on the full data. While I did get some successful instances this time around, the process still failed to go through all the 605 instances. One of the threads errored out with the following trace:

Process Process-1:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/guest_107/vl_nav/teach/src/teach/inference/inference_runner.py", line 121, in _run
    instance_id, instance_metrics = InferenceRunner._run_edh_instance(instance_file, config, model, er)
  File "/home/guest_107/vl_nav/teach/src/teach/inference/inference_runner.py", line 155, in _run_edh_instance
    check_first_return_value=True,
  File "/home/guest_107/vl_nav/teach/src/teach/utils.py", line 370, in with_retry
    raise last_exception
  File "/home/guest_107/vl_nav/teach/src/teach/utils.py", line 359, in with_retry
    output = fn()
  File "/home/guest_107/vl_nav/teach/src/teach/inference/inference_runner.py", line 152, in <lambda>
    edh_instance, game_file, edh_check_task, config.replay_timeout, er
  File "/home/guest_107/vl_nav/teach/src/teach/inference/inference_runner.py", line 263, in _initialize_episode_replay
    init_success, _ = future.result(timeout=replay_timeout)
  File "/usr/local/lib/python3.7/concurrent/futures/_base.py", line 437, in result
    raise TimeoutError()
concurrent.futures._base.TimeoutError

I am using the updated version of the repository. This error is only thrown when --num_processes is passed with a value greater than 1. Please let me know if there are any solutions for making the inference work with multiple processes as it would help save a lot of time in evaluating the model.

Thanks, Divyam

hangjieshi commented 2 years ago

@dv-fenix How many instances were successfully processed? Did you check the GPU memory usage (Command: nvidia-smi)? It's possible that one of the GPU runs out of memory. If that was the case, you could try using all of the available GPUs.

dv-fenix commented 2 years ago

Hi @hangjieshi

A total of 68 instances were successfully processed. I checked the GPU memory usage multiple times during the process, none of the GPUs ran out of memory at any stage. Apart from the manual checks, I also analysed the terminal output file. Had the GPUs run out of memory, RuntimeError: CUDA error: out of memory would have been thrown but it was not.

hangjieshi commented 2 years ago

@dv-fenix It might be a hanging problem in AI2THOR. Can you please try increasing the replay_timeout? If this doesn't help, it's likely an AI2-THOR issue (which we don't have control of) and you would need to rely on restarts.

aishwaryap commented 2 years ago

Hi @dv-fenix

Just wanted to elaborate on the above response.

It's hard to be entirely sure but I think you have run into a thorny issue with AI2-THOR that we have struggled with throughout this project. The problem we have is related to issues 903, 745, and 711 on AI2-THOR (there are probably more), but the behavior we see is not exactly the same as what is listed in those issues. Essentially sometimes during data collection, episode replay, inference (anything where interaction with the AI2-THOR simulator is required), sometimes the process just hangs. Specifically we can trace it to the call to ai2thor.controller.start() getting stuck. We do not currently have an open issue with AI2-THOR on this because at the time we were struggling with this, our code was not public, and we were unable to create an MWE that reproduced the problem. By the time we released the TEACh code, we had stopped facing this issue in our tests so we aren't entirely sure why you're facing it. However, if we are clearly able to show that this is an issue from AI2-THOR, it should be raised as an issue on the ai2thor repo.

One step towards verifying that the issue is indeed from AI2-THOR is to increase InferenceRunnerConfig.replay_timeout here. If it is an AI2-THOR issue, it will persist even if you increase the timeout. If you stop facing the issue when you increase the timeout, this suggests that the machine you are using cannot handle the number of AI2-THOR + model threads you are trying to run.

In either case, it is not strictly necessary to finish evaluating all EDH instances in a single run. If you rerun the inference command keeping --output_dir fixed, it will not re-evaluate EDH instances that have already been completed so it should eventually finish over a few runs. You can then use eval.py here to compute metrics from saved inference files from all runs.

dv-fenix commented 2 years ago

Hi @aishwaryap

Thank you for your insights. I tried using replay_timeout = 1000 and ran into the same issue with multiple threads ending the process on account of a TimeoutError. This indicates that the issue is actually something with AI2-THOR and not TEACh. I will run the inference using a single process to get the results from now.