Open P1llus opened 3 years ago
@jen-huang Do you have an idea on how we can improve user experience on the fleet side?
While this error is specific to how we wanted to add processors to the hbs files, and there was a chance that processors was null
Can one of you provide more context around this issue? What was wrong with the hbs file or agent yaml?
While this error is specific to how we wanted to add processors to the hbs files, and there was a chance that processors was null
Can one of you provide more context around this issue? What was wrong with the hbs file or agent yaml?
In the specific case, there was the possibility that processors:
ended up being empty. Initially we assumed that was something that would incur in no error, but it was breaking the config as @P1llus mentioned.
@marc-gr Sorry, I'm not really familiar with the "system tests" that are executed here, so I need even more context :)
Are these tests run using the test packages in this repo? https://github.com/elastic/elastic-package/tree/f171846c52f3301e35d8b771bd8aaaaeb889f57d/test/packages
Is the issue that one of these packages has incorrect contents in the *.hbs
files, leading to an empty processors:
field?
Fleet in Kibana does not really do any validation on the generated agent YAML. We compile the YAML based on the package *.hbs
files plus user configuration in policies.
Let me rephrase the problem:
Whenever faulty *.hbs
files are pushed to the package, it's hard/impossible to find the problem using Fleet UI (logs view). In the issue description, Marius mentioned the following error:
2021-06-03T14:29:40.250Z ERROR fleet/fleet_gateway.go:180 failed to dispatch actions, error: fail to generate program configuration: failed to add stream processor to configuration: InjectStreamProcessorRule: processors is not a list
It would be great if there is a way to pass this one up to Kibana.
Just to confirm, is there a reason why it has to be passed up to kibana for the system tests @mtojek ? The issue was mostly that system tests does not quit/fail if an error is presented by fleet, are we currently not able to read output from the fleet container during tests?
The workflow is as follows:
There is not API provided by Kibana or Agent to check if something went bad. I'm not sure if we can just observe agent logs and won't hit any false positive. I suppose that if there is an error in ingestion, it should be visualize also in Kibana. Then, we consider Kibana as the "source of truth" for ingestion health and prevent scraping status data from multiple endpoints (agent, fleet server, ingest pipelines, etc).
Thanks @mtojek for reframing the problem, I think I understand now. As far as reporting agent health in Kibana, we already surface that information based on a few fields that the agent reports. That health status does not look at individual logs, it is up the agent to determine which kind of errors puts it in a degraded state and check-in with the Fleet Server with the appropriate status.
I'm not sure if we can just observe agent logs and won't hit any false positive. I suppose that if there is an error in ingestion, it should be visualize also in Kibana. Then, we consider Kibana as the "source of truth" for ingestion health and prevent scraping status data from multiple endpoints
That is fair, and you should be able to get overall agent health (and thus ingestion state) from Fleet's existing agent APIs: GET /api/fleet/agents
, GET /api/fleet/agents/{agent id}
Thanks to @marc-gr for catching this one, it has been causing headaches for me as well, and his pointers help me find the issues in my PR's.
When certain errors happens in agent.log while doing system tests, they are not caught and one out of 2 things will happen:
Both of these does not really show what the error is, and it is quite hard to troubleshoot. However Marc managed to find this log line:
2021-06-03T14:29:40.250Z ERROR fleet/fleet_gateway.go:180 failed to dispatch actions, error: fail to generate program configuration: failed to add stream processor to configuration: InjectStreamProcessorRule: processors is not a list
While this error is specific to how we wanted to add processors to the
hbs
files, and there was a chance that processors wasnull
, I feel that system tests should catch any issues with generating configuration and instantly fail/stop the CI, as this will go on for a LONG time until failing.