Testing issues on TACoS

Soldelli / VLG-Net

VLG-Net: Video-Language Graph Matching Networks for Video Grounding

MIT License

30 stars 1 forks source link

Testing issues on TACoS #8

Closed TensorsSun closed 1 year ago

TensorsSun commented 1 year ago

Hello, when I test on the TACoS dataset (bash scripts/tacos.sh), I encountered the following problem：

Traceback (most recent call last):
  File "/data1/xiaolong/code/VLG-Net/test_net.py", line 101, in <module>
    main()
  File "/data1/xiaolong/code/VLG-Net/test_net.py", line 89, in main
    inference(
  File "/data1/xiaolong/code/VLG-Net/lib/engine/inference.py", line 116, in inference
    return evaluate(dataset=dataset, predictions=predictions, nms_thresh=nms_thresh, iou_metrics=iou_metrics) 
  File "/data1/xiaolong/code/VLG-Net/lib/data/datasets/evaluation.py", line 82, in evaluate
    results = Parallel(n_jobs=num_cpu)(delayed(_eval_parallel)(
  File "/data1/xiaolong/anaconda3/lib/python3.9/site-packages/joblib/parallel.py", line 1056, in __call__
    self.retrieve()
  File "/data1/xiaolong/anaconda3/lib/python3.9/site-packages/joblib/parallel.py", line 935, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/data1/xiaolong/anaconda3/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)
  File "/data1/xiaolong/anaconda3/lib/python3.9/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/data1/xiaolong/anaconda3/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
RuntimeError: stack expects a non-empty TensorList

I used vscode's debug function to debug with breakpoints, but still don't know what's wrong. Can you give me some advice.Thank you very much @Soldelli

Soldelli commented 1 year ago

Hi @TensorsSun debugging parallel code is not trivial. I suggest modifying the code to remove the parallel computation. In order to do so you can change this code to be a simple for loop (removing the Parallel(n_jobs=num_cpu)(delayed(_eval_parallel)() stuff).

Then you can easily debug using ipdb or vscode built in debug suite. Try that and get back to me if you still don't find the issue.

Best, Mattia

TensorsSun commented 1 year ago

Thank you for your suggestion @Soldelli. After I change this code to be a simple for loop, I found that the program throws an error when idx == 126 (num_cpu == 128)，the problem lies in this line of code：

Because len(predictions) == 4001, but the index size of the variable predictions in this line of code will exceed 4001, which results in prediction_[126] and prediction_[127] being two empty lists., So in the _eval_parallel() function , the variable out is also an empty list, which will cause return torch.stack(out,dim=0).sum(dim=0).numpy() to report an error.

I solved this problem by discarding the variable out when the idx value is 126 and 127.

Thank you very much！

Soldelli commented 1 year ago

I would say that as a temporary fix this is acceptable, however, it seems the issue arises before that. Why are those lists empty if we are simply grouping the predictions into N buckets (N=num_cpus).

Make sure you are computing the performance over all test samples or you might have inconsistent numbers.