Sporadic Cohort Save Issues

OHDSI / WebAPI

OHDSI WebAPI contains all OHDSI services that can be called from OHDSI applications

Apache License 2.0

126 stars 156 forks source link

Sporadic Cohort Save Issues #2335

Open alondhe opened 5 months ago

alondhe commented 5 months ago

Expected behavior

(Using WebAPI 2.14.0 / Atlas 2.14.1)

Cohort Definitions save cleanly every time.

Actual behavior

When updating a cohort definition, with even small changes like the name, we see sporadically the save gets stuck.

This is hard to pin down, as it's not consistent. We can't find anything in the WebAPI logs, nor anything in the Postgres logs. The chrome console shows the in-flight PUT command for saving the updated cohort stalls out.

Steps to reproduce behavior

Create a cohort definition
Make some edits
Try to save it
See nothing happens

Tagging @konstjar

chrisknoll commented 5 months ago

Yes, one personal OKR for 2024 for me is addressing technical issues with Atlas/WebAPI, and transaction coordination (which may be the cause of these hangs) is something I'd like to look into.

As things hangs on you, do you see any messages in console (like 500 error responses) or if you view the Postgres Dashboard via pgAdmin do you see any idle transactions or table locks?

alondhe commented 5 months ago

That's the frustrating thing! Nothing we've found in the Chrome console, WebAPI, nor PG.

konstjar commented 5 months ago

I would add few thoughts on this after review with @alondhe

The only related stacktrace we saw in the WebAPI logs is that WebAPI was not able to clear the cohort cache because of timeout connection to OMOP datasource. It was temporay disconnection.

I assume, these steps happen with each "Save" action in the WebAPI:

creates new version record
clears cache in datasource results schema
saves design of the cohort

I do not know exact order, it's just assumption.

And on the step # 2 WebAPI fails because of timeout and it was the reason of the whole "Save" process failure. I think we can try to reproduce it easily.

So, it might be related to #2334