timeouts with memcache make responses very slow

jeffbl commented 2 months ago

@VenissaCarolQuadros was using pegasus, and queries were either failing or coming back very slowly. Looks like problem with memcached:

image-pegasus-cim-mcgill-ca-1  | 2024-08-15T17:38:21.591534258Z 50.17.185.102 - - [15/Aug/2024:17:38:21 +0000] "GET / HTTP/1.1" 200 10042 "-" "FreshpingBot/1.0 (+https://freshping.io/)"
image-pegasus-cim-mcgill-ca-1  | 2024-08-15T17:38:26.450476117Z 50.17.185.102 - - [15/Aug/2024:17:38:26 +0000] "GET / HTTP/1.1" 200 10042 "-" "FreshpingBot/1.0 (+https://freshping.io/)"
orchestrator-1                 | 2024-08-15T17:38:35.286873052Z Received request                                                                                                              
orchestrator-1                 | 2024-08-15T17:38:35.293736219Z Running preprocessors in parallel...                                                                                          
orchestrator-1                 | 2024-08-15T17:38:35.293747550Z Now on priority group 1                                                                                                       
orchestrator-1                 | 2024-08-15T17:38:37.497835237Z MemJS: Server <memcached:11211> failed after (2) retries with error - socket timed out connecting to server.                  
orchestrator-1                 | 2024-08-15T17:38:37.497861847Z MemJS: Server <memcached:11211> failed after (2) retries with error - socket timed out connecting to server.
orchestrator-1                 | 2024-08-15T17:38:37.497866175Z Error getting response from the cache                                                                                         
orchestrator-1                 | 2024-08-15T17:38:37.497869351Z Error getting response from the cache                                          
orchestrator-1                 | 2024-08-15T17:38:37.497872237Z Sending to preprocessor "content-categoriser"                                                                                 
orchestrator-1                 | 2024-08-15T17:38:37.498872704Z Sending to preprocessor "autour-preprocessor"                
content-categoriser-1          | 2024-08-15T17:38:40.297891924Z [2024-08-15 17:38:40 +0000] [7] [DEBUG] POST /preprocessor
content-categoriser-1          | 2024-08-15T17:38:40.298053719Z DEBUG:root:Received request                                                                                                   
autour-preprocessor-1          | 2024-08-15T17:38:40.298104585Z [2024-08-15 17:38:40 +0000] [7] [DEBUG] POST /preprocessor                                                                    
autour-preprocessor-1          | 2024-08-15T17:38:40.298246122Z DEBUG:root:Received request                                                                                                   
autour-preprocessor-1          | 2024-08-15T17:38:40.301946883Z INFO:root:Not map content. Skipping...                                                                                        
content-categoriser-1          | 2024-08-15T17:38:40.304762354Z DEBUG:fsspec.local:open file: /app/latest-0.ckpt                                                                              content-categoriser-1          | 2024-08-15T17:38:40.324916653Z Lightning automatically upgraded your loaded checkpoint from v1.4.4 to v1.9.0. To apply the upgrade to your files permanently,
 run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file latest-0.ckpt`                                                                                                          
content-categoriser-1          | 2024-08-15T17:38:40.434795840Z DEBUG:root:{'category': 'photograph'}                                                                                         
orchestrator-1                 | 2024-08-15T17:38:40.436052550Z Saving Response for ca.mcgill.a11y.image.preprocessor.contentCategoriser in cache with key 0eb2674f6a0d2c12811f7566b297c1f67f4
87f5e                                                                                                                                                                                         
orchestrator-1                 | 2024-08-15T17:38:40.436065535Z storing data in memcache with key 0eb2674f6a0d2c12811f7566b297c1f67f487f5e
orchestrator-1                 | 2024-08-15T17:38:40.436139194Z Now on priority group 2
orchestrator-1                 | 2024-08-15T17:38:42.638187021Z MemJS: Server <memcached:11211> failed after (2) retries with error - socket timed out connecting to server.
orchestrator-1                 | 2024-08-15T17:38:42.638214302Z MemJS: Server <memcached:11211> failed after (2) retries with error - socket timed out connecting to server.
orchestrator-1                 | 2024-08-15T17:38:42.638218410Z MemJS: Server <memcached:11211> failed after (2) retries with error - socket timed out connecting to server.
orchestrator-1                 | 2024-08-15T17:38:42.638237226Z Error setting response in the cache
orchestrator-1                 | 2024-08-15T17:38:42.638240331Z Error getting response from the cache
orchestrator-1                 | 2024-08-15T17:38:42.638243317Z Error getting response from the cache
orchestrator-1                 | 2024-08-15T17:38:42.638248296Z Saved Response for ca.mcgill.a11y.image.preprocessor.contentCategoriser in cache with key 0eb2674f6a0d2c12811f7566b297c1f67f48
7f5e
orchestrator-1                 | 2024-08-15T17:38:42.638251402Z Sending to preprocessor "graphic-tagger"
orchestrator-1                 | 2024-08-15T17:38:42.639293909Z Sending to preprocessor "nominatim-preprocessor"
graphic-tagger-1               | 2024-08-15T17:38:45.441980475Z [2024-08-15 17:38:45 +0000] [7] [DEBUG] POST /preprocessor
graphic-tagger-1               | 2024-08-15T17:38:45.442154032Z DEBUG:root:Received request
nominatim-preprocessor-1       | 2024-08-15T17:38:45.442733536Z Received request
nominatim-preprocessor-1       | 2024-08-15T17:38:45.442835167Z Coordinates not available, cannot make a request for reverse geocode.
graphic-tagger-1               | 2024-08-15T17:38:45.446060972Z DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): canadacentral.api.cognitive.microsoft.com:443
graphic-tagger-1               | 2024-08-15T17:38:45.681472702Z DEBUG:urllib3.connectionpool:https://canadacentral.api.cognitive.microsoft.com:443 "POST /vision/v1.0/analyze?visualFeatures=C
ategories HTTP/11" 200 163
graphic-tagger-1               | 2024-08-15T17:38:45.682043389Z DEBUG:root:{'category': 'other'}
orchestrator-1                 | 2024-08-15T17:38:45.682722429Z Saving Response for ca.mcgill.a11y.image.preprocessor.graphicTagger in cache with key 97181cadeaf306f075c18cd97827d2205dd3f1df
orchestrator-1                 | 2024-08-15T17:38:45.682736195Z storing data in memcache with key 97181cadeaf306f075c18cd97827d2205dd3f1df
orchestrator-1                 | 2024-08-15T17:38:45.682798563Z Now on priority group 3
orchestrator-1                 | 2024-08-15T17:38:47.844376065Z MemJS: Server <memcached:11211> failed after (2) retries with error - socket timed out connecting to server.
orchestrator-1                 | 2024-08-15T17:38:47.844408026Z MemJS: Server <memcached:11211> failed after (2) retries with error - socket timed out connecting to server.
orchestrator-1                 | 2024-08-15T17:38:47.844412123Z MemJS: Server <memcached:11211> failed after (2) retries with error - socket timed out connecting to server.
orchestrator-1                 | 2024-08-15T17:38:47.844415440Z MemJS: Server <memcached:11211> failed after (2) retries with error - socket timed out connecting to server.
orchestrator-1                 | 2024-08-15T17:38:47.844422814Z MemJS: Server <memcached:11211> failed after (2) retries with error - socket timed out connecting to server.
orchestrator-1                 | 2024-08-15T17:38:47.844454684Z Error setting response in the cache
orchestrator-1                 | 2024-08-15T17:38:47.844459312Z Error getting response from the cache
orchestrator-1                 | 2024-08-15T17:38:47.844463350Z Error getting response from the cache
orchestrator-1                 | 2024-08-15T17:38:47.844466716Z Error getting response from the cache
orchestrator-1                 | 2024-08-15T17:38:47.844469852Z Error getting response from the cache
orchestrator-1                 | 2024-08-15T17:38:47.844472928Z Saved Response for ca.mcgill.a11y.image.preprocessor.graphicTagger in cache with key 97181cadeaf306f075c18cd97827d2205dd3f1df
orchestrator-1                 | 2024-08-15T17:38:47.844476535Z Sending to preprocessor "depth-map-generator"

jeffbl commented 2 months ago

Looks like problem is that container memcached simply was not running, so when the orchestrator was asking for cached data, there was no response, and it timed out each request. With queries from different run groups, this meant a number of non-overlapping timeouts, and thus a very slow response time. Starting memcached container solves problem. Questions:

@jaydeepsingh25 memcached doesn't seem to log any output, so not sure if it is a problem in the container that caused a crash, or if it simply failed to start when we rebooted after the power outage last week.
@JRegimbal rather than timing out repeatedly, is there anything smarter the orchestrator could do in case memcached goes bad? The irony here is that the sole purpose of caching is to speed things up, but in this case, it slowed things waaaaay down. :)
We really need a health check implemented for our containers since we don't have enough manual test coverage to notice when things like this happen.

JRegimbal commented 2 months ago

The only thing that comes to mind is that we check if the service name used is a running docker service on the same network before attempting to connect, either on every attempt or the first. Depending on how the request to the cache is implemented, we could also reduce the timeout, but if this is buried in a library it may not be possible.

JRegimbal commented 2 months ago

Discussion from this week's meeting:

a health check (#802) is important in catching if the memcache service is in a good state and restarting it otherwise.
a check that the docker service is running, or just pinging it quickly/adjusting the timeout is a good way to ensure that we don't waste time trying to reach a service that isn't available.
With the assumption that the healthcheck keeps the service in a good state when up and the docker daemon query/quick HTTP request tells if it is up, we should be able to avoid hanging on memcache as can happen now.
These are not quick fixes. Issue should be moved back from backlog when #802 is closed.

Shared-Reality-Lab / IMAGE-server

timeouts with memcache make responses very slow #869