tradefed: use PluginScratch to pass long strings to tasks

mwasilew commented 4 years ago

This patch splits the test case extraction task into smaller chunks and also uses PluginScratch from squad.core.models to pass the pieces of XML from test_results.xml. This allows to avoid passing long strings via message queue broker and prevents SQUAD from prematurely posting build notifications.

Signed-off-by: Milosz Wasilewski milosz.wasilewski@linaro.org

chaws commented 3 years ago

LGTM!

@mwasilew are you planning on doing any extra work on this PR? I can merge it and we can see it in action in staging

mwasilew commented 3 years ago

@chaws yes, I was planning to add a fail safe task that would fire in case of error. But I guess we can roll it out as it is now and test in staging. Just need to add proper dependencies in setup.py.

chaws commented 3 years ago

@mwasilew I'm seeing a number of PluginScratch objects for a given build after fetch is done, it's varying at 10k left over objects. I still can't figure out what's going on, just left this comment here to resume the search tomorrow.

I'm seeing this on rabbitmq node vm_memory_high_watermark set. Memory used:1204260584 allowed:823053516

chaws commented 3 years ago

More on that: https://stackoverflow.com/questions/35808354/celery-and-rabbitmq-eventually-stopping-due-to-memory-exhaustion

chaws commented 3 years ago

I think we have a bug. It's easier to explain with an example: Let's start with one fetch-worker pod, then autoscaler fires up a new fetch-worker. Now there's two pods.

If autoscaler judges there's no more load, it scales down killing one of the fetch-worker pods.

The thing is if the pod that got killed is in the middle of a fetch job using tradefed plugin, it might get killed in the middle of generating create_testcases tasks right in the time it's creating PluginScratch objects.

I spot that by querying a PluginScratch.objects.filter(build=build).count() constant after a scale down event, and no task was added to the queue because line https://github.com/Linaro/squadplugins/blob/master/tradefed/__init__.py#L250 was never reached. Lucky thing is that the job will be re-fetched in the next poll

I need to figure out a way to prevent k8s from killing a working pod

mwasilew commented 3 years ago

Good catch. I don't think there is an easy way to prevent this in the code.

chaws commented 3 years ago

I'll try this later today: https://itnext.io/k8s-prevent-queue-worker-pod-from-being-killed-during-deployment-4252ea7c13f6

mwasilew commented 3 years ago

@chaws I guess this means we need a signal handling code in our tasks, right?

chaws commented 3 years ago

@mwasilew i'm not sure yet, I think SIGTERM is passed to celery thru the container, we just don't tell kubernetes to wait for it and it abruptly unschedule the pod. I think celery should be smart enough to know it should stop receiving new tasks and finish the current ones whenever it receives SIGTERM.

chaws commented 3 years ago

OK, after some debugging, I feel I understand it more.

Celery does the right thing: it receives SIGTERM from k8s, then it shuts down all connection pools, stops receiving new tasks and continue working on ongoing tasks. The problem is that by default k8s will wait for it to terminate, and it waits 30 seconds, which I think is where we should fix it.

My idea for this is increasing terminationGracePeriodSeconds, which defaults to 30 seconds, to lets say 1 hour. If celery finished it earlier, then the pod will be terminated before that period. What do you think?

mwasilew commented 3 years ago

@chaws if there is no way of defining the grace period dynamically from within the pod, let's go with 1h. I don't think there would be tasks longer than 1h after this patch.

chaws commented 3 years ago

@mwasilew i don't think either, but I think it's better to be safe :) Here's the PR for that https://github.com/Linaro/qa-reports.linaro.org/pull/69

I applied a grace period of 1h to staging workers and it worked like a charm (https://staging-qa-reports.linaro.org/~charles.oliveira/test-android-cts-parallel-processing/build/raised-graceperiod/). It finished processing all 157k-ish tasks and all PluginScratch objects were removed. I just didn't see project status being updated, still investigating that.

PS: during my research, I was able to see this bug https://github.com/celery/celery/issues/1700. I think we won't see it again due to raised grace period, but it's convenient to know this might happen

chaws commented 3 years ago

Another thing to keep in mind is that Celery creates persistent queues by default.

From https://docs.celeryproject.org/en/stable/userguide/optimizing.html#using-transient-queues

Queues created by Celery are persistent by default. This means that the broker will write messages to disk to ensure that the tasks will be executed even if the broker is restarted.

After all 157k-ish messages are added to the queue, rabbitmq goes to a stable memory consumption state. I think the problem with memory spike we're having happens before that, when "adding" those messages to queue. Maybe we're doing something wrong there, maybe opening and closing too many connections?

chaws commented 3 years ago

I think I know why ProjectStatus.create_or_update failed: Because celery_chord sends it to the default queue, only our regular workers will be able to receive it, and they are configured to have 1G of RAM and it needs about 4G (!) for that. I think we need to redirect update_build_status to the ci_fetch queue.

mwasilew commented 3 years ago

@chaws do we have squad.core task that does the same? I don't want to route plugin tasks in squad.settings. This isn't right thing to do.

chaws commented 3 years ago

@mwasilew we wouldn't need to change settings, this will do: https://github.com/Linaro/squadplugins/pull/31

Linaro / squadplugins

tradefed: use PluginScratch to pass long strings to tasks #30