IFRCGo / go-api

MIT License
13 stars 6 forks source link

[PROD] Performance issue - extremely slow loading of riskwatch data on GO - please investigate #2184

Open nanometrenat opened 1 week ago

nanometrenat commented 1 week ago

Issue

Risk watch API calls are taking much too long - like, 2 whole minutes to load the "Countries by Risk" data. For example, for Africa, I went to https://go.ifrc.org/regions/0/risk-watch/seasonal and the page itself loaded quickly, however the Countries by Risk was just showing as loading. Looking at devtools I can see that https://go-risk.northeurope.cloudapp.azure.com/api/v1/seasonal/?region=0 and https://go-risk.northeurope.cloudapp.azure.com/api/v1/risk-score/?region=0&limit=9999 each took two mins See screenshots from Devtools below

Similarly, if I am on the Imminent events page and select one of the countries' events then it takes > 8 seconds to load that one event (though in this case I can see it's queuing for a while before it goes, not sure what that means) https://go.ifrc.org/regions/0/risk-watch/imminent page - calls https://go-risk.northeurope.cloudapp.azure.com/api/v1/pdc/99638/exposure/ - took

I have been doing Teams calls etc. on this same internet connection, and also using other parts of GO fine, so not sure why this bit of GO is so slow.

Thanks for your help investigating! cc @justinginnetti

Screenshots etc.

image image I have attached my .har file in Teams if useful for investigating.

image

Expected behaviour

Not sure of our SLA for API responses these days but I think this is too long in any case!

Thanks loads

tovari commented 1 week ago

@thenav56, @szabozoltan69, I'm not sure, if the recent disk storage issue could be the reason for this?

@nanometrenat, do you still experience such long response times?

szabozoltan69 commented 1 week ago

@thenav56, @szabozoltan69, I'm not sure, if the recent disk storage issue could be the reason for this?

It could be the reason. After having more space the mentioned two queries, like this, run fast.

nanometrenat commented 1 week ago

@nanometrenat, do you still experience such long response times?

Hi there, it seems fine today - loading quickly like I would expect. Does the timing of the ticket correlate with when there were storage space issues? If so then that is presumably a valid explanation! Thanks

szabozoltan69 commented 1 week ago

Does the timing of the ticket correlate with when there were storage space issues?

Yes, I think so. Though there was not made any tickets for that, only discussed with @thenav56 .

nanometrenat commented 1 week ago

Great that incident earlier this week was resolved swiftly!

@szabozoltan69 @thenav56 is the root cause also resolved? i.e. monitoring in place so we get alerted so can fix it in advance next time? If so then I will happily close this ticket - thanks again

thenav56 commented 5 days ago

Hey @nanometrenat @szabozoltan69 @tovari,

We had some issues with the background tasks running on the same server as the API server. A memory leak in the background tasks affected the API server. We've added memory usage limits to the workers, which should fix the issue.

We also added swap, which impacted disk storage. This was fixed by using the temporary disk provided by Azure, as suggested by @szabozoltan69.

We've also been working on fixing the memory leak and are currently testing this in nightly. We've integrated Sentry profiling and cron monitoring, and we'll be pushing these changes to staging and production soon.

Let's keep this ticket open for now. Once we've pushed the changes to production, we can revisit and close it đŸ˜„