fleetdm / fleet

Open-source platform for IT, security, and infrastructure teams. (Linux, macOS, Chrome, Windows, cloud, data center)
https://fleetdm.com
Other
3.1k stars 427 forks source link

Expose more information about unresponsive hosts in live query #211

Open anelshaer opened 3 years ago

anelshaer commented 3 years ago

When running a query on large number of Hosts, sometimes there are offline hosts.

queries keeps waiting for them reply. would it be a good solution to have a configurable timeout and return two outcome, results and offline hosts list which did not reply to the query.

what do you think?

noahtalerman commented 3 years ago

Hi @anelshaer. In the upcoming 3.7.0 release, we plan to reveal more information when running a live query in the Fleet UI.

This information will include the error message(s) osquery returns if a host fails and the number of offline hosts targeted during a live query. More discussion can be found in the comments in issue #192.

In the coming months, we'd like to improve the live query experience after the above immediate changes. Your suggestions and use cases are very helpful in determining these additional improvements we'd like to make. What is the number of hosts you typically run one live query against? Why would you like the see a list of the offline hosts which did not reply?

anelshaer commented 3 years ago

Hi @noahtalerman

Apologies for late reply totally missed it. last time i ran a live query on a big number was 15K but i expect it to double+ in coming month.

we were trying to scope which machines didn't get an update or a specific version of package. it's critical to know which machines did not respond so we can target it using a different tool, maybe fix osquery, or check machine's health, etc.

noahtalerman commented 3 years ago

it's critical to know which machines did not respond so we can target it using a different tool, maybe fix osquery, or check machine's health, etc.

This makes sense. Thank you for explaining some of the motivation for identifying which of the targeted hosts are offline.

When you'd like to see which hosts are offline while running a live query, are you targeting groups of hosts with labels or with a different grouping mechanism?

What identifying information (ex. hostname) is helpful for this next step you explained. When you'd like to target the offline hosts using a different tool, fix osquery, etc. ?

anelshaer commented 3 years ago

for these cases, i normally target "ALL HOSTS" or Lables, not sure if there is another method of grouping you could have/suggest. information like hostname(+IP optional) is good enough i guess in the next step.

noahtalerman commented 3 years ago

for these cases, i normally target "ALL HOSTS" or Lables

To see all offline hosts in Fleet, one grouping method is to view the list of hosts on the Hosts page and select the "Offline" filter in the right sidebar. Is this not sufficient for your use case?

anelshaer commented 3 years ago

Thanks for sharing, i've been using fleet for almost 2 years, i know this tag exist.

i usually target different labels which this tag wont work i guess. and when targeting "ALL HOSTS" i really dont trust it. it maybe some machines came back online or any type of situation. having offline machines part of the results would make more sense i guess.

noahtalerman commented 3 years ago

Understood. I really appreciate your responses. When I ask if the "All Hosts" tag on the Host page is sufficient for your use case, I'm referring to this specific use case you mentioned:

it's critical to know which machines did not respond so we can target it using a different tool, maybe fix osquery, or check machine's health, etc.

I'm most curious about how Fleet can best help you achieve the above goal of targeting offline hosts with a different tool, fixing osquery, etc.

Maybe the best way Fleet can help is to include offline machines as part of the results like you mentioned. Or maybe another solution is the ability to export a list of all offline machines from the Hosts page.

anelshaer commented 3 years ago

i really appreciate all discussions that you brought since you took over looking to these submission Noah. i guess having 2 lists downloadable like offline hosts, and hosts that able respond (online hosts) even if they didn't have a results are very useful.

noahtalerman commented 3 years ago

I'm making the assumption that you could use this list of offline hosts to target the offline hosts using a different tool, maybe fix osquery, or check machine's health.

After collecting information on which hosts are currently offline (the list), what are the steps you normally take to complete the above process (using a different tool to target, fixing osquery, checking the machine's health) ?

anelshaer commented 3 years ago

the offline hosts are a bit tricky to use because you know its normal that some machines get decommissioned. but smaller lists would be better when you run a scan and you get list for offlines and onlines.

then you can compare that and see which machines are decommed and which are not. then you can target the existing servers using puppet/ansible/rundeck to get them in a ready state.

noahtalerman commented 3 years ago

Aha, your response is very helpful. I now understand that once you know which offline hosts are actually expected to be online, you can target these hosts with the tools you mentioned to get them back online and ready to be queried.

Right now, returning a list of offline hosts is one solution we've been discussing. I imagine there may be other future solutions to bring these expected hosts back online (ready state).

The information you've included about decommissioned hosts is new to me. Once you know which hosts have been decommissioned, do these decommed machines remain enrolled to Fleet? Are you manually removing them?

anelshaer commented 3 years ago

Decommissioned hosts are are no longer needed. Fleet have setting for offline hosts i guess i configured it to 3 days. If a host is offline for 3 days then it’s considered dead an no longer needed i guess. I’ll have to think abou that really!

On Tue 2. Feb 2021 at 7:10 PM, noahtalerman notifications@github.com wrote:

Aha, your response is very helpful. I now understand that once you know which offline hosts are actually expected to be online, you can target these hosts with the tools you mentioned to get them back online and ready to be queried.

Right now, returning a list of offline hosts is one solution we've been discussing. I imagine there may be other future solutions to bring these expected hosts back online (ready state).

The information you've included about decommissioned hosts is new to me. Once you know which hosts have been decommissioned, do these decommed machines remain enrolled to Fleet? Are you manually removing them?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fleetdm/fleet/issues/211#issuecomment-771854616, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7F5JO4IDVWH4MUIXQ3NSDS5A5ZBANCNFSM4WK423LA .

anelshaer commented 3 years ago

to share a recent case i just had now.

offline hosts are 70, when i ran the query you can see a confusing info which is, the offline will always in the total no. hosts being queried and fleet is waiting for them to reply and there are also failed hosts which i guess something different that offline.

this is not a desired state you want to see because in security investigation know which host failed, offline, and online is the idea so you can treat each one differently. (specially in an investigation)

image image

anelshaer commented 3 years ago

@noahtalerman the above here, offline tab showed 70 offline, but if you calculated this you will see you have 78, 6 are failed so 2 machines are in unknown state. i hope we could have better visibility on that data.

noahtalerman commented 3 years ago

Your example use case is very helpful.

Given what you've shared, the Hosts page could be used to achieve the goal of WHICH hosts are offline (the 70 hosts).

For example, you might run a live query and see that 9721 out of 9799 hosts are responding. Then when you visit the Hosts page with the offline filter applied, you're presented with a list of the 70 hosts that are offline.

I agree that the offline filter on the Hosts page is only somewhat helpful because, when you run a live query, Fleet doesn't tell you which hosts failed and which hosts are in this "unknown" state. In your example, there are 8 hosts unaccounted for. I agree that this is inconsistent and confusing.

Do you not "trust" the list of offline hosts on the Hosts page because these results might be different from what you see when you run a live query? You mentioned some hosts might come back online in the time between you running a live query and you visiting the Hosts page

anelshaer commented 3 years ago

First, i dont trust the offline filter because its always un accurate when you run a live query.

Second, i dont have a clue if a machine state will change during a live query.

That is why having a way to make the offline filter more accurate, having live query results offline and online and failed hosts is crucial specially if you are handing results to investigators or auditors .

On Thu 4. Feb 2021 at 6:36 PM, noahtalerman notifications@github.com wrote:

Your example use case is very helpful.

Given what you've shared, the Hosts page could be used to achieve the goal of WHICH hosts are offline (the 70 hosts).

For example, you might run a live query and see that 9721 out of 9799 hosts are responding. Then when you visit the Hosts page with the offline filter applied, you're presented with a list of the 70 hosts that are offline.

I agree that the offline filter on the Hosts page is only somewhat helpful because, when you run a live query, Fleet doesn't tell you which hosts failed and which hosts are in this "unknown" state. In your example, there are 8 hosts unaccounted for. I agree that this is inconsistent and confusing.

Do you not "trust" the list of offline hosts on the Hosts page because these results might be different from what you see when you run a live query? You mentioned some hosts might come back online in the time between you running a live query and you visiting the Hosts page

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fleetdm/fleet/issues/211#issuecomment-773483149, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7F5JP7BC7FOGBTDWYYLYTS5LLK3ANCNFSM4WK423LA .

noahtalerman commented 3 years ago

handing results to investigators or auditors

How do you typically hand off live query results to these individuals? Is it by exporting the live query results? Something else?

Related to this discussion: In the release of Fleet 3.7.1, we've made improvements to the live query UI experience which include revealing errors and presenting the number of offline hosts.

Please voice your feedback when you get the chance to try these changes!

dachin11 commented 3 years ago

I agree with the requestor here - having offline hosts separated from hosts with errors is key. I watched the demo from 224 and it looks much better!