No trial found by trial datastore

Traceback

-4a2d-9c8f-1eeae30d90fb\" found" grpc.code=Unknown grpc.method=RetrieveSamples grpc.service=cogment.TrialDatastoreSP grpc.start_time="2021-1
1-22T16:13:10Z" grpc.time_ms=0.085 peer.address="172.19.0.11:48948" span.kind=server system=grpc                                            
torch_agents_1     | ERROR:asyncio:Task exception was never retrieved                                                                       
torch_agents_1     | future: <Task finished coro=<RunSession._start_trials.<locals>.trials_samples_listener() done, defined at /base_python/
cogment_verse/run/run_session.py:94> exception=<AioRpcError of RPC that terminated with:                                                    
torch_agents_1     |    status = StatusCode.UNKNOWN                                                                                         
torch_agents_1     |    details = "no trial "075acedd-0687-4a2d-9c8f-1eeae30d90fb" found"                                                   
torch_agents_1     |    debug_error_string = "{"created":"@1637597590.904118764","description":"Error received from peer ipv4:172.19.0.3:900
1","file":"src/core/lib/surface/call.cc","file_line":1066,"grpc_message":"no trial "075acedd-0687-4a2d-9c8f-1eeae30d90fb" found","grpc_statu
s":2}"                                                                                                                                      
torch_agents_1     | >>                                                                                                                     
torch_agents_1     | Traceback (most recent call last):                                                                                     
torch_agents_1     |   File "/base_python/cogment_verse/run/run_session.py", line 115, in trials_samples_listener                           
torch_agents_1     |     async for sample in sample_generator():                                                                            
torch_agents_1     |   File "/base_python/cogment_verse/trial_datastore_client.py", line 38, in sample_generator                            
torch_agents_1     |     async for rep_msg in rep_stream:                                                                                   
torch_agents_1     |   File "/usr/local/lib/python3.7/site-packages/grpc/aio/_call.py", line 321, in _fetch_stream_responses                
torch_agents_1     |     await self._raise_for_status()                                                                                     
torch_agents_1     |   File "/usr/local/lib/python3.7/site-packages/grpc/aio/_call.py", line 232, in _raise_for_status                      
torch_agents_1     |     self._cython_call.status())                                                                                        
torch_agents_1     | grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:                                                  
torch_agents_1     |    status = StatusCode.UNKNOWN                                                                                         
torch_agents_1     |    details = "no trial "075acedd-0687-4a2d-9c8f-1eeae30d90fb" found"                                                   
torch_agents_1     |    debug_error_string = "{"created":"@1637597590.904118764","description":"Error received from peer ipv4:172.19.0.3:900
1","file":"src/core/lib/surface/call.cc","file_line":1066,"grpc_message":"no trial "075acedd-0687-4a2d-9c8f-1eeae30d90fb" found","grpc_statu
s":2}"                                                                                                                                      
torch_agents_1     | >

Seems to be a race condition or an eviction issue in the trial datastore, and is exacerbated by increasing the number of parallel trials.

After some investigation, the issue comes from the way RunSession. _start_trials deals with trials that are started but take a long time to be known by the trial-datastore. The current code basically gives the 5000ms to the trial to generate its first sample (the default timeout for TrialDatastoreClient.retrieve_trials) and then start listening to it no matter what. If a trial takes longer, the above exception is raised.

To sidestep the issue, simply increase the default timeout in TrialDatastoreClient.retrieve_trials.

The proper fix would solve two issues:

Make sure this kind of errors bubbles up properly (here it was logged but basically ignored)
Deal properly with the slow trial by retrying later

cogment / cogment-verse

No trial found by trial datastore #25