Closed muloem closed 4 years ago
@muloem update from the support thread that started this issue:
Hey Jana,
So I completed this fix and it was merged but the issue persists so I looked again and there is still more optimisation to be done. The root cause of the issue is that there is a timeout when attempting to retrieve surveys and associated forms. And that seems to be connected to the number of questions that a survey has. One of the issues I pointed out was that we retrieve questions twice. Today I also looked into the structure of querying and it it seems that the surveys on this particular instance are suffering from having loads of option questions. In the cases Molo is struggling with, we have a survey with 94 option questions (almost 50%) of the total questions.
So the API works like this
- retrieve the general survey definition from the datastore (1 request)
- retrieve the form definition from the datastore (1 request to get all forms)
- for each form retrieve the question groups (1 request per form for the question groups)
- for each question group retrieve the questions (1 request for each question group)
- for each of the questions that are option questions retrieve the options (1 request per option question)
Which means in our case, the last request happens 94 times! that end up in a timeout and so the request is cancelled. We need to unfortunately dedicate sometime to improving this as well.
I hope this helps to explain but there is no solution at the moment. :(
IMO It seems that our GAE lifewater application is taking lot time to answer requests
Looking into some of these slow responses we can see that app is not running at the time the request is done so it takes a lot of time to start again ... but a normal API does'n wait for so many time so the connection is closed by client generating the 499 Http status code
Perhaps @dlebrero, next week, could shed some light on that (re)starting app topic
BTW: after trying a few times
curl --silent \
--header "Content-Type: application/json" \
--header "Accept: application/vnd.akvo.flow.v2+json" \
--header "Authorization: Bearer $token" \
--url "https://api-auth0.akvotest.org/flow/orgs/lifewater/surveys/736820982" | jq -M '.'
suddenly it started to work fine
What is the survey id that is failing?
Just realised that is in the url.
@dlebrero the issue started with trying to import a dataset to Lumen: idh.akvoflow.org > Farmfit surveys > Cases 2020 > Batian Nuts (Kenya) > Labour cost update (ID:100150011)
And then the Lifewater issue came in.
The issue is a mix of both what @muloem and @tangrammer mention.
flow-api is very chatty when talking with the Google Datastore, which generates a ton of load. GAE sees all the load and decides to create more GAE instances, which sometimes takes several seconds to start.
Note that sometimes GAE calls the /_ah/warmup endpoint before the new instance receives traffic, but other times it does not
Being practical, I see from the logs that for this particular survey, most of the time flow-api is able to answer within 60 seconds, but k8s defaults to a 30 secs timeout:
I would suggest to change the flow-api k8s config to 60 seconds for now. I think @tangrammer knows who to do this as we had this issue before, but let me know if you need help.
Context
When attempting to retrieve forms for a specific survey via the api, we get a timeout error and the form retrieval fails.
Problem or idea
Looking further into the issue, we see that the request that the flow api makes to the flow backend takes too long to resolve thus resulting in the timeout. The part of the API that retrieves form data is duplicating work done by the flow DAO classes. They already return question information but then the API retrieves this information again.
Solution or next step
Remove the duplicated task.