Closed zemm closed 2 years ago
And while I still have the environment, let's also see the w.waitIndex
:
$ grep watcher\.go /tmp/filebeat.log |tail -n5
2021-07-24T00:03:54.294Z DEBUG [nomad] nomad/watcher.go:127 Allocations index is unchanged remoteWaitIndex=25 == localWaitIndex=25
2021-07-24T00:04:08.864Z DEBUG [nomad] nomad/watcher.go:106 Syncing allocations and metadata
2021-07-24T00:04:08.864Z DEBUG [nomad] nomad/watcher.go:107 Starting with WaitIndex=25
2021-07-24T00:04:24.232Z DEBUG [nomad] nomad/watcher.go:120 Found 5 allocations
2021-07-24T00:04:24.232Z DEBUG [nomad] nomad/watcher.go:127 Allocations index is unchanged remoteWaitIndex=25 == localWaitIndex=25
And from the Nomad event stream we can see that the value of 25
has already passed the two run
events that were missed, so these will not be picked up:
1a7c549f-9895-ea29-9ad0-630e95ddaf4f CreateIndex=11 ModifyIndex=20 AllocModifyIndex=11 DesiredStatus=run ClientStatus=running Type=AllocationUpdated
a02e35fa-c0bd-bf0e-107f-0f6a045664e2 CreateIndex=11 ModifyIndex=20 AllocModifyIndex=11 DesiredStatus=run ClientStatus=running Type=AllocationUpdated
Pinging @elastic/integrations (Team:Integrations)
cc: @jsoriano
At the risk of this already being way more verbose than it needs for a single line change, here is the filter-alloc-events-2.jq
used above to reduce the Nomad even api messages:
.Events | map(
(.Payload.Allocation.ID | tostring)
+ " CreateIndex=" + (.Payload.Allocation.CreateIndex | tostring)
+ " ModifyIndex=" + (.Payload.Allocation.ModifyIndex | tostring)
+ " AllocModifyIndex=" + (.Payload.Allocation.AllocModifyIndex | tostring)
+ " DesiredStatus=" + .Payload.Allocation.DesiredStatus
+ " ClientStatus=" + .Payload.Allocation.ClientStatus
+ " Type=" + .Type
) | .[]
@zemm thanks for reporting this! Do you think that only changing AllocModifyIndex
with ModifyIndex
would fix this issue?
I am not planning to work with nomad autodiscover soon, but if you can open a pull request with the proposed change and you have the chance to try it on a real deployment I will be happy to review the changes.
@jsoriano I'm sure the current version isn't usable, seems like I might be the first one to actually run it :sweat_smile:. With a brief testing previously, I think this change is enough, but I can (and must) run this on a real, albeit small 3+3 node, deployment within a couple of weeks. I can monitor that for a while and report how it's going.
For now, I'd rather not sign the Contributor Agreement only for this one line. Maybe if there'll be more complex changes or improvements.
And thank you for this feature!
I've now run Filebeat 7.14.1 with this one line patch
- if w.waitIndex > alloc.AllocModifyIndex {
+ if w.waitIndex > alloc.ModifyIndex {
for a week (I was busy with other things) with 6 Nomad client hosts and 50+ allocations. There have been few rotations and restarts, and logging seems to have been keeping up.
I have not verified that every log line has been collected, but currently I have no reason to believe othewise. I'll see if I have at some point time to go through some verifications, but looks to be working for now.
I checked the all of the currently running 53 allocations, and all of them have been picked up by this patched version, meaning that I found plenty of logs for each of the 53 allocation ids from the central log storage that they should be sent to. 29 of these allocations have been created after the Filebeats have been started.
I won't be checking the log lines themselves, since that's the regular log
-type input watching regular files.
I'll be checking back if something comes up, but looks good to me.
Hey, sorry to piggybacking on this issue but I can't seem to get the logs from the nomad autodiscovery to be processed.
I've tried this config :
filebeat.autodiscover.providers:
- type: nomad
node: redacted
address: "http://redacted:4646"
secret_id: "redacted"
scope: node
hints.enabled: true
templates:
- config:
- type: log
paths:
- "/var/nomad/alloc/${data.nomad.alloc_id}/alloc/logs/${data.nomad.task.name}.stderr.[0-9]*"
Filebeat is indeed seeing events (even with the bug @zemm metionned) but no logs are outputed.
autodiscover/autodiscover.go:172 Got a start event. {"autodiscover.event": {"config":[],"host":"redacted","id":"cf62901c-16e4-2528-a4a7-1a35fa92bb35-app","meta":{"nomad":{"allocation":{"id":"cf62901c-16e4-2528-a4a7-1a35fa92bb35","name":"app[2]","status":"running"},"datacenter":["dc5"],"job":{"name":"app","type":"service"},"namespace":"default","region":"region","task":{"name":"app"}}},"nomad":{"allocation":{"id":"cf62901c-16e4-2528-a4a7-1a35fa92bb35","name":"app[2]","status":"running"},"datacenter":["dc5"],"job":{"app","type":"service"},"namespace":"default","region":"region"},"provider":"f6f26bdf-985f-41f1-9620-368c5fdf6bcc","start":true}}
I'm running filebeat 7.15.0 directly on the VM host (ubuntu 20.04), not inside nomad. Do someone have a clue, or is also using this?
@zemm thanks for testing the change! I am sorry because I couldn't dedicate time to this, but on a quick look I wonder if this condition would also need to be updated? https://github.com/elastic/beats/blob/85621341020e21ba543f2305e9a1e35f66aacb63/x-pack/libbeat/common/nomad/watcher.go#L159
Would you have the chance of opening a PR for further discussion?
Longer-term it'd be nice to reimplement this watcher using the events API instead of periodically polling, but this is another story :slightly_smiling_face:
How did I miss that :scream: Thanks!
I'll try to find time to replicate missed events with in-place allocation updates, which on a quick glance could be the case with the current code, and then try with a fix.
For the PR, unfortunately I still don't want to bind myself and save my personal details into some legal agreement system over an obvious fix of 2 words. I also fully understand the need for the agreement, and therefore won't contribute a formal commit and PR at this time. Sorry for making things tedious :(
@zemm understood, thanks for your time so far! I have created a PR with the code you propose, and a test case that seems to reproduce the issue. Please take a look: https://github.com/elastic/beats/pull/28700
Let me know if you can confirm that the change in the other condition works for you.
@jsoriano thanks, and sorry for I did not have, and would not have had in this month, time or energy to test the OnUpdate
logic on this. I did some quick testing, but at the end wasn't sure what the proper scenario and implications would have been.
And sorry to comment too late to a resolved ticket, but one thing I did notice previously was that in some cases the w.handler.OnDelete() will be called multiple times. I've not had any problems with this in practice. Is that method idempotent, and therefore while not optimal, not an immediate problem?
Hey @zemm, no problem, I think that the changes you proposed are good. If you have the chance of doing more tests, please keep us updated! I think we will release these changes in 7.16.0 and 8.0.0.
And sorry to comment too late to a resolved ticket, but one thing I did notice previously was that in some cases the w.handler.OnDelete() will be called multiple times. I've not had any problems with this in practice. Is that method idempotent, and therefore while not optimal, not an immediate problem?
Umm, this should only happen when the allocation is being terminated, I guess that it can transition through different termination statuses, and for each one the provider may call this handler, but it should be fine, multiple deletions should be idempotent as you say. We can revisit this in any case if this is problematic.
Thanks for your help on this!
Nomad autodiscover introduced in #16853 seems to completely miss some new Nomad tasks, and therefore never ship any logs for them. The cause seems to be that it uses event's
AllocModifyIndex
to check for changes, when it should actually useModifyIndex
.Test setup
1) Let's start a local Nomad dev instance. I'm using sudo here, because my task run with exec driver. This also requires sudo for alloc tasks later.
2) Let's run filebeat with a minimal Nomad autodiscovery config outputting to
/tmp
:3) Let's save Nomad allocation events for later comparison
4) Let's run some test jobs
I created two test jobs with tasks that are just
exec
scripts which produce test log by printing out it's name and a counter from 1 to 10 in intervals of 1 second, after which they idle silently. There should be 10 lines of log per task at the end.Let's run two test jobs. The fist one consists of two groups of one task each. The second one has one group with three tasks. It's not that important, but this setup reproduced the problem for me. Optionally wait some time in between.
Inspecting the results
We should have five allocations:
The allocations should have produced 50 lines of log (5 times 10 lines):
We should have 50 lines of filebeat log, but in my case I only got 30. This will vary between the runs, but most of the time some tasks are never picked up.
Analyzing the problem
The watcher prints out allocation info when it receives an event of allocation that is running:
I only got 3 of the 5 allocations here, as inspected with the 30 log lines shipped. Now let's inspect the Nomad allocations event stream that was saved in the beginning:
As we can see, Filebeat completely missed allocations
a02e35fa-c0bd-bf0e-107f-0f6a045664e2
and1a7c549f-9895-ea29-9ad0-630e95ddaf4f
. Further more we can see that theAllocModifyIndex
remains the same in consecutive events for a given allocation, whileModifyIndex
grows for each event per allocation. AlsoModifyIndex
only ever advances, where asAllocModifyIndex
actually chronologically sometimes goes backwards, but it's compared against advancing counter.Now looking at the
nomad/watcher.go
codeThis should actually compare against
ModifyIndex
instead. I'd maybe also consider adding a debug message for the skip, since that's what helped me while debugging this.I've built and tested this earlier, but don't have it on hand now.