System test CI is missing agent error and never quits

P1llus commented 3 years ago

Thanks to @marc-gr for catching this one, it has been causing headaches for me as well, and his pointers help me find the issues in my PR's.

When certain errors happens in agent.log while doing system tests, they are not caught and one out of 2 things will happen:

System tests both locally and on CI will just hang/loop until jenkins times out.
System test CI will fail on the step "Running elastic-package stack down".

Both of these does not really show what the error is, and it is quite hard to troubleshoot. However Marc managed to find this log line: 2021-06-03T14:29:40.250Z ERROR fleet/fleet_gateway.go:180 failed to dispatch actions, error: fail to generate program configuration: failed to add stream processor to configuration: InjectStreamProcessorRule: processors is not a list

While this error is specific to how we wanted to add processors to the hbs files, and there was a chance that processors was null, I feel that system tests should catch any issues with generating configuration and instantly fail/stop the CI, as this will go on for a LONG time until failing.

mtojek commented 3 years ago

@jen-huang Do you have an idea on how we can improve user experience on the fleet side?

jen-huang commented 3 years ago

While this error is specific to how we wanted to add processors to the hbs files, and there was a chance that processors was null

Can one of you provide more context around this issue? What was wrong with the hbs file or agent yaml?

marc-gr commented 3 years ago

While this error is specific to how we wanted to add processors to the hbs files, and there was a chance that processors was null

Can one of you provide more context around this issue? What was wrong with the hbs file or agent yaml?

In the specific case, there was the possibility that processors: ended up being empty. Initially we assumed that was something that would incur in no error, but it was breaking the config as @P1llus mentioned.

jen-huang commented 3 years ago

@marc-gr Sorry, I'm not really familiar with the "system tests" that are executed here, so I need even more context :)

Are these tests run using the test packages in this repo? https://github.com/elastic/elastic-package/tree/f171846c52f3301e35d8b771bd8aaaaeb889f57d/test/packages

Is the issue that one of these packages has incorrect contents in the *.hbs files, leading to an empty processors: field?

Fleet in Kibana does not really do any validation on the generated agent YAML. We compile the YAML based on the package *.hbs files plus user configuration in policies.

mtojek commented 3 years ago

Let me rephrase the problem:

Whenever faulty *.hbs files are pushed to the package, it's hard/impossible to find the problem using Fleet UI (logs view). In the issue description, Marius mentioned the following error:

2021-06-03T14:29:40.250Z ERROR fleet/fleet_gateway.go:180 failed to dispatch actions, error: fail to generate program configuration: failed to add stream processor to configuration: InjectStreamProcessorRule: processors is not a list

It would be great if there is a way to pass this one up to Kibana.

P1llus commented 3 years ago

Just to confirm, is there a reason why it has to be passed up to kibana for the system tests @mtojek ? The issue was mostly that system tests does not quit/fail if an error is presented by fleet, are we currently not able to read output from the fleet container during tests?

mtojek commented 3 years ago

The workflow is as follows:

Assign new policy to agent.
Start watching data stream for new metrics until they're present or timeout.

There is not API provided by Kibana or Agent to check if something went bad. I'm not sure if we can just observe agent logs and won't hit any false positive. I suppose that if there is an error in ingestion, it should be visualize also in Kibana. Then, we consider Kibana as the "source of truth" for ingestion health and prevent scraping status data from multiple endpoints (agent, fleet server, ingest pipelines, etc).

jen-huang commented 3 years ago

Thanks @mtojek for reframing the problem, I think I understand now. As far as reporting agent health in Kibana, we already surface that information based on a few fields that the agent reports. That health status does not look at individual logs, it is up the agent to determine which kind of errors puts it in a degraded state and check-in with the Fleet Server with the appropriate status.

I'm not sure if we can just observe agent logs and won't hit any false positive. I suppose that if there is an error in ingestion, it should be visualize also in Kibana. Then, we consider Kibana as the "source of truth" for ingestion health and prevent scraping status data from multiple endpoints

That is fair, and you should be able to get overall agent health (and thus ingestion state) from Fleet's existing agent APIs: GET /api/fleet/agents, GET /api/fleet/agents/{agent id}

elastic / elastic-package

System test CI is missing agent error and never quits #376