akvo / akvo-flow-api

Akvo Flow API
GNU Affero General Public License v3.0
6 stars 3 forks source link

Retrieving form times out when contacting flow backend via remote API #231

Closed muloem closed 4 years ago

muloem commented 4 years ago

Context

When attempting to retrieve forms for a specific survey via the api, we get a timeout error and the form retrieval fails.

Problem or idea

Looking further into the issue, we see that the request that the flow api makes to the flow backend takes too long to resolve thus resulting in the timeout. The part of the API that retrieves form data is duplicating work done by the flow DAO classes. They already return question information but then the API retrieves this information again.

Solution or next step

Remove the duplicated task.

janagombitova commented 4 years ago

@muloem update from the support thread that started this issue:

Hey Jana,

So I completed this fix and it was merged but the issue persists so I looked again and there is still more optimisation to be done. The root cause of the issue is that there is a timeout when attempting to retrieve surveys and associated forms. And that seems to be connected to the number of questions that a survey has. One of the issues I pointed out was that we retrieve questions twice. Today I also looked into the structure of querying and it it seems that the surveys on this particular instance are suffering from having loads of option questions. In the cases Molo is struggling with, we have a survey with 94 option questions (almost 50%) of the total questions.

So the API works like this

  • retrieve the general survey definition from the datastore (1 request)
  • retrieve the form definition from the datastore (1 request to get all forms)
  • for each form retrieve the question groups (1 request per form for the question groups)
  • for each question group retrieve the questions (1 request for each question group)
  • for each of the questions that are option questions retrieve the options (1 request per option question)

Which means in our case, the last request happens 94 times! that end up in a timeout and so the request is cancelled. We need to unfortunately dedicate sometime to improving this as well.

I hope this helps to explain but there is no solution at the moment. :(

tangrammer commented 4 years ago

IMO It seems that our GAE lifewater application is taking lot time to answer requests

Screenshot 2020-08-21 at 10 46 16

https://console.cloud.google.com/logs/viewer?project=akvoflow-73&authuser=1&resource=gae_app%2Fmodule_id%2Fdefault%2Fversion_id%2F77cb97915574009db2c9be27a708e321f443ca67&minLogLevel=0&expandAll=false&customFacets&limitCustomFacetWidth=true&dateRangeStart=2020-08-21T02%3A36%3A22.073Z&dateRangeEnd=2020-08-21T08%3A36%3A22.073Z&interval=PT6H&logName=projects%2Fakvoflow-73%2Flogs%2Fappengine.googleapis.com%252Frequest_log&scrollTimestamp=2020-08-21T08%3A17%3A01.150525000Z&advancedFilter=resource.type%3D%22gae_app%22%0Aresource.labels.module_id%3D%22default%22%0Aresource.labels.version_id%3D%2277cb97915574009db2c9be27a708e321f443ca67%22%0AlogName%3D%22projects%2Fakvoflow-73%2Flogs%2Fappengine.googleapis.com%252Frequest_log%22%0AprotoPayload.latency%3E%3D%227s%22

Looking into some of these slow responses we can see that app is not running at the time the request is done so it takes a lot of time to start again ... but a normal API does'n wait for so many time so the connection is closed by client generating the 499 Http status code

Screenshot 2020-08-21 at 10 27 25 Screenshot 2020-08-21 at 10 27 39

Perhaps @dlebrero, next week, could shed some light on that (re)starting app topic

BTW: after trying a few times

curl --silent \
     --header "Content-Type: application/json" \
     --header "Accept: application/vnd.akvo.flow.v2+json" \
     --header "Authorization: Bearer $token" \
     --url "https://api-auth0.akvotest.org/flow/orgs/lifewater/surveys/736820982" | jq -M '.'

suddenly it started to work fine

dlebrero commented 4 years ago

What is the survey id that is failing?

dlebrero commented 4 years ago

Just realised that is in the url.

janagombitova commented 4 years ago

@dlebrero the issue started with trying to import a dataset to Lumen: idh.akvoflow.org > Farmfit surveys > Cases 2020 > Batian Nuts (Kenya) > Labour cost update (ID:100150011)

And then the Lifewater issue came in.

dlebrero commented 4 years ago

The issue is a mix of both what @muloem and @tangrammer mention.

flow-api is very chatty when talking with the Google Datastore, which generates a ton of load. GAE sees all the load and decides to create more GAE instances, which sometimes takes several seconds to start.

Note that sometimes GAE calls the /_ah/warmup endpoint before the new instance receives traffic, but other times it does not

Being practical, I see from the logs that for this particular survey, most of the time flow-api is able to answer within 60 seconds, but k8s defaults to a 30 secs timeout:

Screenshot 2020-08-24 at 13 10 41

I would suggest to change the flow-api k8s config to 60 seconds for now. I think @tangrammer knows who to do this as we had this issue before, but let me know if you need help.