facebookresearch / habitat-challenge

Code for the habitat challenge
https://aihabitat.org
MIT License
306 stars 56 forks source link

test-std Phase Submission Error (code runs successfully on minival) #82

Closed bucherb closed 2 years ago

bucherb commented 3 years ago

I submitted my code successfully to the minival challenge phase:

evalai push [my docker image] --phase habitat21-objectnav-minival --private

My submission completed successfully, and I could view my results.

However, when I submit to the test-std challenge phase with the below command, I get the error copied at the bottom of my post, and the status of my submission is "Failed".

evalai push [my docker image] --phase habitat21-objectnav-test-std --private

I saw this same error on the minival phase the first time I submitted. I thought that the error meant that my code ran for too long, and the job was killed because I timed out. I sped up my code, resubmitted, and the error went away. Then, my submission successfully completed on the minival phase.

This error happens after my code runs for 30 minutes on the test-std phase. I previously saw this error after 30 minutes running on the minival phase. My understanding is that our submissions on the test-std phase have 48 hours to complete. There are no submissions visible yet on the test-std public leaderboard, so I do not know if anyone else ran their code successfully on this phase.

In summary, my questions are:

  1. Does this error indicate a timeout for my job?
  2. If so, how long does our code have to run on the test-std phase?
  3. If not, do you have any insight on what this error indicates?

Thank you for any help you can provide!

Error message:

Traceback (most recent call last): File "challenge_agent.py", line 308, in main() File "challenge_agent.py", line 304, in main challenge.submit(agent) File "/habitat-lab/habitat/core/challenge.py", line 19, in submit metrics = super().evaluate(agent) File "/habitat-lab/habitat/core/benchmark.py", line 163, in evaluate return self.remote_evaluate(agent, num_episodes) File "/habitat-lab/habitat/core/benchmark.py", line 93, in remote_evaluate SerializedEntity=pack_for_grpc(action) File "/opt/conda/envs/habitat/lib/python3.6/site-packages/grpcio-1.36.0rc1-py3.6-linux-x86_64.egg/grpc/_channel.py", line 923, in call return _end_unary_response_blocking(state, call, False, None) File "/opt/conda/envs/habitat/lib/python3.6/site-packages/grpcio-1.36.0rc1-py3.6-linux-x86_64.egg/grpc/_channel.py", line 826, in _end_unary_response_blocking raise _InactiveRpcError(state) grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "Socket closed" debug_error_string = "{"created":"@1620152076.442809572","description":"Error received from peer ipv4:127.0.0.1:8085","file":"src/core/lib/surface/call.cc","file_line":1067,"grpc_message":"Socket closed","grpc_status":14}"

dhruvbatra commented 3 years ago

CC: @RishabhJain2018 and @mathfac

mathfac commented 3 years ago

@bucherb, thank you for letting us know. Are you sure you haven't reached 5 submissions limit to the challenge phase? Thank you!

bucherb commented 3 years ago

Thanks for the quick response! I am submitting to Test-Standard, not Test-Challenge. To Test-Standard, this is my first of 5 maximum daily submissions, and my first of 9999 total submissions. I have not yet submitted to the Test-Challenge phase.

karkuspeter commented 3 years ago

I have experienced a similar issue with all my submissions to pointnav test-std. After an arbitrary number of episodes I get the following error that shows up in the stderr output:

Traceback (most recent call last):
  File "agent.py", line 269, in <module>
    main()
  File "agent.py", line 245, in main
    challenge.submit(agent)
  File "/habitat-lab/habitat/core/challenge.py", line 19, in submit
    metrics = super().evaluate(agent)
  File "/habitat-lab/habitat/core/benchmark.py", line 163, in evaluate
    return self.remote_evaluate(agent, num_episodes)
  File "/habitat-lab/habitat/core/benchmark.py", line 93, in remote_evaluate
    SerializedEntity=pack_for_grpc(action)
  File "/opt/conda/envs/habitat/lib/python3.6/site-packages/grpcio-1.36.0-py3.6-linux-x86_64.egg/grpc/_channel.py", line 923, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/opt/conda/envs/habitat/lib/python3.6/site-packages/grpcio-1.36.0-py3.6-linux-x86_64.egg/grpc/_channel.py", line 826, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
    status = StatusCode.UNAVAILABLE
    details = "Socket closed"
    debug_error_string = "{"created":"@1619253167.394935390","description":"Error received from peer ipv4:127.0.0.1:8085","file":"src/core/lib/surface/call.cc","file_line":1067,"grpc_message":"Socket closed","grpc_status":14}"
mathfac commented 3 years ago

Thank you @karkuspeter looking into this.

ruizvitor commented 3 years ago

I have experienced a similar issue when submitting to the ObjectNav Test-Standard. The execution time was 1839.163746 seconds according to the evalai and resulted in Failed. It was my first submission to this phase. I successfully submitted to the minival phase where execution time was reported as 497.966314 seconds.

Traceback (most recent call last): File "beyond_agent/eval.py", line 154, in main() File "beyond_agent/eval.py", line 150, in main challenge.submit(agent) File "/opt/conda/envs/habitat/lib/python3.6/site-packages/habitat/core/challenge.py", line 19, in submit metrics = super().evaluate(agent) File "/opt/conda/envs/habitat/lib/python3.6/site-packages/habitat/core/benchmark.py", line 163, in evaluate return self.remote_evaluate(agent, num_episodes) File "/opt/conda/envs/habitat/lib/python3.6/site-packages/habitat/core/benchmark.py", line 93, in remote_evaluate SerializedEntity=pack_for_grpc(action) File "/opt/conda/envs/habitat/lib/python3.6/site-packages/grpcio-1.36.0-py3.6-linux-x86_64.egg/grpc/_channel.py", line 923, in call return _end_unary_response_blocking(state, call, False, None) File "/opt/conda/envs/habitat/lib/python3.6/site-packages/grpcio-1.36.0-py3.6-linux-x86_64.egg/grpc/_channel.py", line 826, in _end_unary_response_blocking raise _InactiveRpcError(state) grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "Socket closed" debug_error_string = "{"created":"@1620835640.883601407","description":"Error received from peer ipv4:127.0.0.1:8085","file":"src/core/lib/surface/call.cc","file_line":1067,"grpc_message":"Socket closed","grpc_status":14}"

karkuspeter commented 3 years ago

Hi @mathfac, @RishabhJain2018 do you have any update on this issue? The challenge deadline is in 2 weeks and this is blocking me now for submission. thank you!

RishabhJain2018 commented 3 years ago

Hi @ruizvitor, Sorry for the inconvenience caused. The issue is fixed. Can you please try making the submissions?

RishabhJain2018 commented 3 years ago

Hi @bucherb, Sorry for the inconvenience caused. The issue is fixed. Can you please try making the submissions?

RishabhJain2018 commented 3 years ago

Hi @karkuspeter, Sorry for the inconvenience caused. The issue is fixed. Can you please try making the submissions?

bucherb commented 3 years ago

Thanks @RishabhJain2018! I just submitted again. I will update you when it successfully completes.

karkuspeter commented 3 years ago

Hi @RishabhJain2018, thanks for checking. I have made a new submission, but it has failed with the same error after 2.5 hours. The submission id: 144526

Traceback (most recent call last):
  File "agent.py", line 271, in <module>
    main()
  File "agent.py", line 247, in main
    challenge.submit(agent)
  File "/habitat-lab/habitat/core/challenge.py", line 19, in submit
    metrics = super().evaluate(agent)
  File "/habitat-lab/habitat/core/benchmark.py", line 163, in evaluate
    return self.remote_evaluate(agent, num_episodes)
  File "/habitat-lab/habitat/core/benchmark.py", line 93, in remote_evaluate
    SerializedEntity=pack_for_grpc(action)
  File "/opt/conda/envs/habitat/lib/python3.6/site-packages/grpcio-1.36.0-py3.6-linux-x86_64.egg/grpc/_channel.py", line 923, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/opt/conda/envs/habitat/lib/python3.6/site-packages/grpcio-1.36.0-py3.6-linux-x86_64.egg/grpc/_channel.py", line 826, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
    status = StatusCode.UNAVAILABLE
    details = "Socket closed"
    debug_error_string = "{"created":"@1621361245.558768157","description":"Error received from peer ipv4:127.0.0.1:8085","file":"src/core/lib/surface/call.cc","file_line":1067,"grpc_message":"Socket closed","grpc_status":14}"
bucherb commented 3 years ago

My submission finished successfully! Thank you so much!

karkuspeter commented 3 years ago

I'm still repeatedly getting the same error. I'm submitting to pointnav not objectnav track if that's relevant.

RishabhJain2018 commented 3 years ago

Hi @karkuspeter, We are trying to debug your issue. Moreover, if possible can you please submit to Habitat Challenge 2020 -- https://eval.ai/web/challenges/challenge-page/580/overview and see if it runs successfully in that case. cc: @dhruvbatra @mathfac

karkuspeter commented 3 years ago

Thanks @RishabhJain2018. I have tried submitting to the 2020 challenge, but the status has stayed "Submitted" for the last 12 hours. I will report back if it gets scheduled.

ruizvitor commented 3 years ago

Hi @ruizvitor, Sorry for the inconvenience caused. The issue is fixed. Can you please try making the submissions?

Thank you @RishabhJain2018 I also submitted successfully to the ObjectNav Test-Standard phase.

karkuspeter commented 3 years ago

Hi @RishabhJain2018 my submission to the 2020 challenge completed successfully. The same code (with different neural net weights) keeps failing for the 2021 challenge. The same error appears after an arbitrary amount of time.

mathfac commented 3 years ago

Hi @karkuspeter, We did several fixes. Did you try resubmission, and if was it successful?

karkuspeter commented 3 years ago

Hi, @mathfac @RishabhJain2018, thanks for trying to get to the bottom of this.

I have mixed success. One version of my agent has succeeded, but a better version that uses a planner in a separate thread still failed (145633). The failed submission was made ~4 days ago, the container was running for 76 hours, but according to stderr the usual InactiveRpcError occured already after a couple of hours. The same code consistently succeeded for the 2020 challenge in the past.

I have resubmitted the same code now in case you made fixes in the last 3 days, will report back on the result.

bucherb commented 3 years ago

That's interesting that your submissions are hanging in running with the InactiveRpcError at the end. I posted this issue on the EvalAI forum. I am getting enormously varying run times, and I have run extensive local testing which does not have my timing issue. I am not sure it is related to your issue since I have yet to see one of my hanging submissions end with the InactiveRpcError message, but I figured I would link it here for additional information.

karkuspeter commented 3 years ago

Yes my runtimes also vary dramatically, both for successful and failed submissions.

mathfac commented 3 years ago

I have mixed success. One version of my agent has succeeded, but a better version that uses a planner in a separate thread still failed (145633). The failed submission was made ~4 days ago, the container was running for 76 hours, but according to stderr the usual InactiveRpcError occured already after a couple of hours.

@karkuspeter, that should be fixed for submissions after 26 of May. Let us know, if you see similar behavior.

@bucherb, @karkuspeter the evaluation time reported in EvalAI also includes wait time for a submission. That can be the reason for variability. If the submission is in the Running state for several days then it's definitely running on the server for such a long time.

The ObjectNav evaluation can take 2+ days for test standard and test challenge phases. We significantly scaled evaluation backend recently and the queue is pretty short.

bucherb commented 3 years ago

Thanks for the explanation! Will submissions to the CVPR challenges which finish execution after the June 8th challenge deadline (but were submitted before) still be considered in the competition?

karkuspeter commented 3 years ago

Thanks @mathfac. Unfortunately my last submission still ended with the same error, InactiveRpcError. Submission id 146436.

karkuspeter commented 3 years ago

Just to report back again, I tried submitting to the pointnav-test-challenge phase, the container was in 'running' state for 58 hours, and then failed at episode 104 with the usual InactiveRpcError.

mathfac commented 3 years ago

Will submissions to the CVPR challenges which finish execution after the June 8th challenge deadline (but were submitted before) still be considered in the competition?

Yes, deadline is for the submissions only. The evaluation can extend further.

mathfac commented 3 years ago

@karkuspeter, thank you for the reporting. We looking into this. Do you know if that container succeeds on minival phase? For you information, we extending the challenges deadline to 7th of June.

karkuspeter commented 3 years ago

Thanks @mathfac. I have just double checked, the same code succeeds on minival (see 146996)

mathfac commented 3 years ago

@RishabhJain2018 worked to rerun your submission and had detailed look.