Anthony-Nolan / Atlas

A free & open-source Donor Search Algorithm Service
GNU General Public License v3.0
9 stars 5 forks source link

Investigate long-running searches on WMDA #960

Open zabeen opened 1 year ago

zabeen commented 1 year ago

@mmelchers reported several searches for WMDA-DEV-ATLAS initiated on morning of 27/04/2023 had not completed.

Checking search orchestrator on same afternoon showing several jobs were still running. By morning of next day, all searches had completed. Notifications showed the following info (match prediction time, overall search time, and number of donors)

Searches with "normal" match prediction times, but long overall search time: 6174b9b2-2e6d-400f-b3fc-59ab67844903 (1h,15h,12K) 269f3676-52b8-4733-b772-effd11a5a175 (1h,7h,311K) a296d4ee-f0ad-4ccb-94bf-4ebf8378ae66 (5h,15h,592K) 0294a442-e2f4-4a54-805c-fb17ed77d97f (1h,15h,219K) 4cbeef6e-653c-48aa-9425-1616d707a35a (1h,15h,185K) 3487cb37-ee5d-4ec3-a9a7-222068013ccf (1h,15h,14K)

Searches with "long" match prediction times: 38730179-432d-40b9-bb5d-3ec040448a7e (14h,15h,907K) 23416a12-b0d0-4020-aa6e-79e20e84a0e5 (14h,16h,592K) 97917fd5-e468-4599-8448-3e44c899dd86 (15h,15h,202K) 2b4bf10a-20cb-4c85-9cf5-08013595dd9f (14h,15h,907K) e147ce44-3692-477d-a092-3a648708e1d5 (14h,15h,460K)

All the match prediction times seem consistent with the numbers of donors being processed, however there are several searches where search orchestration time (overall time minus match prediction time) is very high. As these searches were all running at the same time, this is possibly due to lack of availability of free activity functions to upload the search results. Need to confirm this by interrogating the logs further.

zabeen commented 1 year ago

@seanmobrien I have assigned this investigation to you but please reassign to whomever is available on your team. Unfortunately, #956 means that not all log entries may be tagged with the search request id, but the traces should all be on AI "somewhere". Let me know if you need more AI querying pointers.

zabeen commented 1 year ago

Have moved this to the backlog for when we start load testing