alertmanager / alert_manager

Splunk Alert Manager with advanced reporting on alerts, workflows (modify assignee, status, severity) and auto-resolve features
Other
81 stars 44 forks source link

Alert Manager not working at all on Splunk Cloud #250

Open ramirocastillo93 opened 4 years ago

ramirocastillo93 commented 4 years ago

Hi,

I've installed the Alert Manager app and add-on on my Splunk Cloud instance but I can't make it work. I've followed the instructions from the official documentation page (http://docs.alertmanager.info/en/latest/installation_manual/) but still, it isn't working at all.

I don't know what else to do. I've been reading and searching a lot, but I couldn't find anything helpful.

The image below shows the _internal index and it's showing some errors. Idk how to interpretate them.

image

Let me know if you guys need some more information from me.

Thanks in advance!

BeanBagKing commented 4 years ago

I get the same thing, errors appear to be the same "insufficient permission to access this resource" among others. I've tried opening up the permissions on both the app and the alert as much as I am able to, but so far no luck.

BeanBagKing commented 4 years ago

Update:

I dug up a previous email I sent about this error with some redacted logs. The timestamp on these is pretty old, but nothing has changed in the behavior or messages.

We are trying to setup Alert Manager in our environment (Splunk Cloud) but are receiving an error when an alert fires. I believe the most relevant part of the error is “insufficient permission to access this resource”, however, I’ve included all events from Settings -> Alert Actions -> Alert Manager -> View log events in case there is something else we overlooked. We have tried changing the alert settings to global, with Everyone having read permissions, and power/sc_admin having Write permissions. This did not seem to have an effect and the error still occurred and we reverted back to App sharing. We also came across a Splunk Answers comment that indicated the user should have edit_tcp rights. However, this doesn’t appear to be a right we can assign, nor is it clear what user should have it. https://answers.splunk.com/answers/519114/why-am-i-no-longer-seeing-new-alerts-in-the-alerts.html

Time       Event 
     7/23/19
9:09:33.821 AM  
07-23-2019 13:09:33.821 +0000 WARN  sendmodalert - action=alert_manager - Alert action script returned error code=1

    host = sh-i-xxxx.foobar.splunkcloud.com   
    source =    /opt/splunk/var/log/splunk/splunkd.log 
    sourcetype = splunkd    

     7/23/19
9:09:33.821 AM  
07-23-2019 13:09:33.821 +0000 INFO  sendmodalert - action=alert_manager - Alert action script completed in duration=317 ms with exit code=1

    host = sh-i-xxxx.foobar.splunkcloud.com   
    source =    /opt/splunk/var/log/splunk/splunkd.log 
    sourcetype = splunkd    

     7/23/19
9:09:33.806 AM  
07-23-2019 13:09:33.806 +0000 ERROR sendmodalert - action=alert_manager STDERR -  splunk.RESTException: [HTTP 403] ['message type=WARN code=None text=insufficient permission to access this resource;']

    host = sh-i-xxxx.foobar.splunkcloud.com   
    source =    /opt/splunk/var/log/splunk/splunkd.log 
    sourcetype = splunkd    

     7/23/19
9:09:33.806 AM  
07-23-2019 13:09:33.806 +0000 ERROR sendmodalert - action=alert_manager STDERR -      raise splunk.RESTException, (serverResponse.status, msg_text)

    host = sh-i-xxxx.foobar.splunkcloud.com   
    source =    /opt/splunk/var/log/splunk/splunkd.log 
    sourcetype = splunkd    

     7/23/19
9:09:33.806 AM  
07-23-2019 13:09:33.806 +0000 ERROR sendmodalert - action=alert_manager STDERR -    File "/opt/splunk/lib/python2.7/site-packages/splunk/input.py", line 180, in submit

    host = sh-i-xxxx.foobar.splunkcloud.com   
    source =    /opt/splunk/var/log/splunk/splunkd.log 
    sourcetype = splunkd    

     7/23/19
9:09:33.806 AM  
07-23-2019 13:09:33.806 +0000 ERROR sendmodalert - action=alert_manager STDERR -      input.submit(event, hostname = socket.gethostname(), sourcetype = 'incident_change', source = 'alert_handler.py', index=index)

    host = sh-i-xxxx.foobar.splunkcloud.com   
    source =    /opt/splunk/var/log/splunk/splunkd.log 
    sourcetype = splunkd    

     7/23/19
9:09:33.806 AM  
07-23-2019 13:09:33.806 +0000 ERROR sendmodalert - action=alert_manager STDERR -    File "/opt/splunk/etc/apps/alert_manager/bin/alert_manager.py", line 175, in createIncidentChangeEvent

    host = sh-i-xxxx.foobar.splunkcloud.com   
    source =    /opt/splunk/var/log/splunk/splunkd.log 
    sourcetype = splunkd    

     7/23/19
9:09:33.806 AM  
07-23-2019 13:09:33.806 +0000 ERROR sendmodalert - action=alert_manager STDERR -      createIncidentChangeEvent(event, metadata['job_id'], settings.get('index'))

    host = sh-i-xxxx.foobar.splunkcloud.com   
    source =    /opt/splunk/var/log/splunk/splunkd.log 
    sourcetype = splunkd    

     7/23/19
9:09:33.806 AM  
07-23-2019 13:09:33.806 +0000 ERROR sendmodalert - action=alert_manager STDERR -    File "/opt/splunk/etc/apps/alert_manager/bin/alert_manager.py", line 498, in <module>

    host = sh-i-xxxx.foobar.splunkcloud.com   
    source =    /opt/splunk/var/log/splunk/splunkd.log 
    sourcetype = splunkd    

     7/23/19
9:09:33.806 AM  
07-23-2019 13:09:33.806 +0000 ERROR sendmodalert - action=alert_manager STDERR -  Traceback (most recent call last):

    host = sh-i-xxxx.foobar.splunkcloud.com   
    source =    /opt/splunk/var/log/splunk/splunkd.log 
    sourcetype = splunkd    

     7/23/19
9:09:33.503 AM  
07-23-2019 13:09:33.503 +0000 INFO  sendmodalert - Invoking modular alert action=alert_manager for search="foobar - Fraud (Alert)" sid="scheduler_cGV0ZXJtMzBAZXJhdS5lZHU_ZXJhdV9pdHNzX2FwcA__RMD56f377aa7a6baaa0f_at_1563887100_13453" in app="foobar_itss_app" owner="user@foobar.com" type="saved"

    host = sh-i-xxxx.foobar.splunkcloud.com   
    source =    /opt/splunk/var/log/splunk/splunkd.log 
    sourcetype = splunkd    
my2ndhead commented 4 years ago

If you are using a custom index for alert manager you need to create the same one also on the Search Head. Let me know if it works.

my2ndhead commented 4 years ago

Fixed also an issue in the TA that affects Alert Manager 2.2.2

john0499 commented 4 years ago

I have just set this up in Splunk Cloud and have it working. I had to log a support ticket an get cloud ops to enable the edit_tcp capability in my stack - this is not available by default. Once enabled the alert manger roles will show up.

I also had to ask ops to create my custom index on the search head. Even though you can create indexes in cloud via the search head, they only get configured on the indexers.

kbwin commented 4 years ago

Is the edit_tcp capability a requirement for the app? I've been given the impression that edit_tcp is not allowed at all in Splunk Cloud. We have the app installed but not working due to the security roles not being visible.

john0499 commented 4 years ago

The edit_tcp setting it allowed, you just have to open a case with support. As soon as they enabled it the alert manager roles will show up. I also needed to get the alerts index created on the search head.

Regards, Lachlan

From: kbwin notifications@github.com Sent: Tuesday, 1 September 2020 8:21 AM To: alertmanager/alert_manager alert_manager@noreply.github.com Cc: Lachlan Johnson lachlan.johnson@flinders.edu.au; Comment comment@noreply.github.com Subject: Re: [alertmanager/alert_manager] Alert Manager not working at all on Splunk Cloud (#250)

Is the edit_tcp capability a requirement for the app? I've been given the impression that edit_tcp is not allowed at all in Splunk Cloud. We have the app installed but not working due to the security roles not being visible.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/alertmanager/alert_manager/issues/250#issuecomment-684083152, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALJOSMKDII2CN5FR5GPPV63SDQSLHANCNFSM4K4OCIOA.

BeanBagKing commented 4 years ago

I've been continuing to work on this for the past few months. Dealing with Splunk Cloud is like communicating with Mars. At this point we are on Splunk Cloud 8.0.2, TA-alert_manager 2.3.1, and alert_manager 3.0.4.

Going to the Alert Manager Incident Posture page shows the following error for everything: Error in 'SearchParser': Missing a search command before '('. Error at position '2' of search query '| ((index="main" OR index="alerts")) sourcetype="a'.

The "alerts" index is present, and does have events going into it.

The Health Check page shows three Warnings:

Check that non-admin users have been assigned the admin_manager role
Check that non-admin users have been assigned the admin_manager_user role
Check for deprecated TA-alert_manager

There's two problems I'm hoping someone can help me with.

  1. Does the error or warnings on the health check page give anyone an idea of what might be wrong? A direction I can point Splunk support in and ask them to go check on something?

  2. Regarding the Add-on (3365), the installation instructions are rather vague. Step #1 says to download the Add-on, then it says to "Install the App on the Search Head only. Uninstall any existing instances of TA-alert_manager." However, it never says where to install the Add-on, unless the reference to App there is supposed to be the Add-on. Under the "Update" section, it says to remove all instances of TA-alert_manager. If this is correct, should the references to it be removed from the Installation portion? TL;DR - Is the TA/Add-on still used, and if so, exactly where should it be installed?

my2ndhead commented 4 years ago

The TA-alert_manager is not needed anymore. I'm recommending to at least install the most recent version (config matches that of the Alert Manager app) or better to remove it.

Events not going to the index, means probably that edit_tcp is missing.

Can you check your alert_manager_index macro for correctness?

BeanBagKing commented 4 years ago

I removed TA-alert_manager, though that didn't seem to make a difference.

I do believe there are events going into our index. This is a search for "index=alerts": https://i.imgur.com/WRPDhiz.jpg I blurred out anything remotely identifiable, but you can see the index name, the timestamp, and the general format. This is the most recent event/group. Is there something else that should be there that isn't?

Regarding edit_tcp, I got the following feedback from Splunk some time ago:

In Splunk Cloud, the "edit_tcp" capability is not allowed for our customers as it contains features that could cause server issues if not used correctly, for the same reason it not allowed, however, let me tell you that this was corrected in Splunk Cloud 8.0.2003 GA (Balfour) by adding the "edit_log_alert_event" capability that was specially created for this reason

We are on a newer version that 8.0.2003, so I -presume- this is enabled (per our request). However, without the ability to check myself, we are going to have to open a new ticket to check and make sure. I'm doing that this morning and will report back. I'm also wondering if those that have been told that edit_tcp was enabled are actually having edit_tcp enabled (inconsistencies with Splunk Support), or if they are being told it is, while really they're just enabling the new equivalent.

Lastly, what exactly do you mean by "Can you check your alert_manager_index macro for correctness?". I'm happy to do so, just a little lost here.

BeanBagKing commented 4 years ago

Update to this. It does appear edit_tcp is enabled, if I'm going to the correct place. This is under Settings -> Roles -> Select "alert_manager" -> View "2. Capabilities", and edit_tcp is selected as native.

As the app still wasn't working, we also gave it "edit_log_alert_event" per the Splunk Support feedback.

Edit: forgot to add, under Roles, there is also an "alert_manager_user". This one likewise has both the edit_tcp and edit_log_alert_event.

Any further suggestions?

kbwin commented 4 years ago

I received similar feedback from Splunk Cloud Ops in regards to the "edit_tcp" capability and new "edit_log_alert_event" capability. I was able to get cloud ops to modify the authorize.conf deployed with Alert Manager. I believe they removed the "edit_tcp" capability and added the new "edit_log_alert_event" capability to the two Alert Manager roles in the authorize.conf. As soon as they did, I was able to see the "alert_manager" and "alert_manager_user" roles (they were not visible to me before). I added my account and others to the roles. So far Alert Manager seems to now be working correctly and I'm able to assign and edit incidents.

We are running Splunk Cloud 8.0.2007.1 and Alert Manager 3.0.3. We do not have the TA-alert_manager installed. Everything is green "ok" on the health check.

It took a while to work though this with cloud ops but I think this was basically the actions taken to get it working for us. I will say, I don't think we ever had the Search Parser error that you @BeanBagKing are receiving on the Alert Manager Incident Posture page.

BeanBagKing commented 4 years ago

When you say you had to get them to modify authorize.conf, does that imply that these roles aren't a setting that user can modify themselves? Like I said, I was looking under settings -> roles. Within the alert_manager role in that location, there was an edit_tcp option I could select. If this isn't the right location though, then that may be the source of all my issues.

I was hoping you were onto something with the users @kbwin, but after adding the role, I still see the same error.

scoxspau commented 4 years ago

I received similar feedback from Splunk Cloud Ops in regards to the "edit_tcp" capability and new "edit_log_alert_event" capability. I was able to get cloud ops to modify the authorize.conf deployed with Alert Manager. I believe they removed the "edit_tcp" capability and added the new "edit_log_alert_event" capability to the two Alert Manager roles in the authorize.conf. As soon as they did, I was able to see the "alert_manager" and "alert_manager_user" roles (they were not visible to me before). I added my account and others to the roles. So far Alert Manager seems to now be working correctly and I'm able to assign and edit incidents.

We are running Splunk Cloud 8.0.2007.1 and Alert Manager 3.0.3. We do not have the TA-alert_manager installed. Everything is green "ok" on the health check.

It took a while to work though this with cloud ops but I think this was basically the actions taken to get it working for us. I will say, I don't think we ever had the Search Parser error that you @BeanBagKing are receiving on the Alert Manager Incident Posture page.

Could you provide your Splunk support ticket number for this? I'm about to do the 8.0 upgrade & this will help the support team.

kbwin commented 4 years ago

Yeah, @BeanBagKing it sounds like the edit_tcp capability has been enabled for your environment. The two roles were not showing up for me under the settings -> roles until they made the modifications on their side. I can't even see the edit_tcp in the role capability list for selection on a role.

Ops was pretty clear with me that edit_tcp would not be allowed and that was why I could not see the roles initially. Fortunately, they eventually redirected me towards the new "edit_log_alert_event" capability. They had to make the changes but it seems to have worked in our case.

Your error on the dashboard almost seems more likely to be configuration issue in the app. Any chance any of the app macros have been modified? (Settings -> Advanced Search -> Macros) You should be able to find the alert_manager_index macro there.

kbwin commented 4 years ago

Yup, @scoxspau case number 1879281. They might be a little confused reviewing the initial part of the case. Ops had suggested in a previous case that I submit a modified version of the app with the "edit_tcp" capability removed. I quickly realized into this case that was not going to work.

john0499 commented 4 years ago

We're upgrading our stack from 7.x to 8.x with Alert Manager 3 next week. I've asked ops to make this edit_log_alert_event change during the upgrade. Will report back on how it goes.

BeanBagKing commented 4 years ago

Your error on the dashboard almost seems more likely to be configuration issue in the app. Any chance any of the app macros have been modified? (Settings -> Advanced Search -> Macros) You should be able to find the alert_manager_index macro there.

Everything there looks correct for "alert_manager_index". Definition is (index="main" OR index="alerts"), No owner, alert_manager app, sharing is global, status is enabled.

HOWEVER, I see a lot of what looks like cloned definitions on this page: https://i.imgur.com/FHzxSKl.png

It looks like something wasn't uninstalled correctly. This is with a filter of "alert_manager". Unless this is what it's supposed to look like, any ideas for cleanup? I presume getting Splunk Support to uninstall/reinstall would remove everything and then put it back. Can anyone share a screenshot of working macros?

scoxspau commented 4 years ago

@john0499 did you get a response from ops/support? I asked a similar question and got a pretty unhelpful response...

john0499 commented 4 years ago

They couldn't commit to doing it during our stacks upgrade window, and if they don't I'll have to log another ticket once the upgrade is complete. I won't know until Friday.

Anyone here use the splunk usergroup slack - there's an alertmanager channel if you want to discuss on there.

kbwin commented 4 years ago

@BeanBagKing You may have already figured it out but the numbers you are seeing on the macro names represent argument counts; not cloned copies. Looking at your screenshot everything looks correct to me. You can always download the app if you want to review any of the configuration files it includes.

john0499 commented 4 years ago

Well that was interesting. Had been planning the Splunk 8 upgrade with suppport for weeks. Received an email saying it was happening last night. Turns out that email was from "fleetwide" and not related to my support case. Fleetwide enforce stack upgrades on you apparently, even when you're in the process of carefully planning one.

So since this was an unplanned upgrade, we are now running 8.0.2006 but will with Alert Manager 2.2.2. Everything is working, the roles are visible and I still have option to add/remove the edit_tcp capability.

john0499 commented 3 years ago

Upgrade to 3.04 last night. Initially Incident Posture wasn't displaying correctly. Turns out this was a browser cache problem with the .js files. There's no way to do a _bump in Splunk Cloud, so just had to clear cache.

Seems to be working OK now and I still have the edit_tcp capabilities available and assigned to the alert manager roles.