Closed TensorsSun closed 1 year ago
Hi @TensorsSun debugging parallel code is not trivial. I suggest modifying the code to remove the parallel computation.
In order to do so you can change this code to be a simple for loop (removing the Parallel(n_jobs=num_cpu)(delayed(_eval_parallel)()
stuff).
Then you can easily debug using ipdb or vscode built in debug suite. Try that and get back to me if you still don't find the issue.
Best, Mattia
Thank you for your suggestion @Soldelli. After I change this code to be a simple for loop, I found that the program throws an error when idx == 126
(num_cpu == 128
),the problem lies in this line of code:
Because len(predictions) == 4001
, but the index size of the variable predictions
in this line of code will exceed 4001, which results in prediction_[126]
and prediction_[127]
being two empty lists., So in the _eval_parallel()
function , the variable out
is also an empty list, which will cause return torch.stack(out,dim=0).sum(dim=0).numpy()
to report an error.
I solved this problem by discarding the variable out
when the idx
value is 126 and 127.
Thank you very much!
I would say that as a temporary fix this is acceptable, however, it seems the issue arises before that. Why are those lists empty if we are simply grouping the predictions into N buckets (N=num_cpus).
Make sure you are computing the performance over all test samples or you might have inconsistent numbers.
Hello, when I test on the TACoS dataset (bash scripts/tacos.sh), I encountered the following problem:
I used vscode's debug function to debug with breakpoints, but still don't know what's wrong. Can you give me some advice.Thank you very much @Soldelli