Azure / Azure-Sentinel

Cloud-native SIEM for intelligent security analytics for your entire enterprise.
https://azure.microsoft.com/en-us/services/azure-sentinel/
MIT License
4.58k stars 3.01k forks source link

Logic apps that are triggered by 'Incident Creation' at times get stuck - and do not execute on trigger #10106

Closed Kaloszer closed 6 months ago

Kaloszer commented 8 months ago

Describe the bug Rarely but it occurs regularly the consumption logic app that has the following trigger:


{
  "type": "ApiConnectionWebhook",
  "inputs": {
    "host": {
      "connection": {
        "referenceName": "azuresentinel"
      }
    },
    "body": {
      "callback_url": "@{listCallbackUrl()}"
    },
    "path": "/incident-creation"
  }
}

Get's 'stuck' and no longer triggers. The only way to solve this issue is to disable/enable the logic app. This can cause a lot of damage in case a logic app which is supposed to send an escalation, and does not do so. There is no information that would imply that the trigger had been made and failed, so no alerting is possible - this is a critical issue that we've had for quite some time but hadn't been reported yet.

image

There are a lot of missed triggers, and only after it had been restarted it started picking up new ones (note the date differences between actions, I've restarted it on 07.03 - and that's where you see it start to pick up the slack again)

This seems to be an issue with logic apps consumption themselves, so not sure if this is the valid place to raise this. But it should raise awareness of this critical issue.

To Reproduce Steps to reproduce the behavior:

  1. Have a long standing recurring logic app that does 'something'
  2. After a few months of working fine it gets 'stuck'
  3. It no longer fires triggers
  4. Restart the logic app and it starts working again

Expected behavior A logic app that has an event trigger should not get stuck :)

Screenshots See above

Additional context This is not a one time issue, this has been happening over the past year multiple times, causing SLA breaches and missed critical incidents. We can't afford to have this be ignored.

v-muuppugund commented 7 months ago

Hi @Kaloszer , Thanks for flagging this issue, we will investigate this issue and get back to you with some updates by 13Mar24. Thanks!

v-muuppugund commented 7 months ago

Hi @Kaloszer ,still need some more time for detailed analysis ,will get back to you.

v-muuppugund commented 7 months ago

Hi @Kaloszer ,tried to replicate the generic issue,but unable to have that much long running or Can you please share the logic app details ,so will try to set up and replicate from my end

Kaloszer commented 7 months ago

@v-muuppugund You would need to be running this for months to eventually hit that issue. A week is is nothing for this issue.

It happens for pretty much every PB that has that particular trigger eventually so it doesn't matter what you have after it, even if it is a print text command. It is however critical as it causes logic apps that would otherwise execute not do anything at all. And there's trigger failure either - ergo can't alert anyone to restart the PB - because well, nothing is seemingly wrong with it.

PS: As I mentioned, I'm not sure if this is isolated to this particular trigger - as we've only experienced it there. But you can probably reach out to the logic app team and see if they're aware of that issue.

v-muuppugund commented 7 months ago

@Kaloszer , sure ,Will check there is an existing similar issue from logic apps,will check and upate you.

bisskar commented 7 months ago

If you relay in one of the blocks on System Alert ARM/ID, f.e using it in logic app in KQL query it is known issue that when incident is created it takes up to 2min to fully "connect" Incident to Alerts which results of errors. Would be more clear if you provide your logic app blocks. Workarround for this is after Incident trigger put a delay 2min..

Kaloszer commented 7 months ago

@bisskar But the issue here isn't that it takes 'some time to connect' or invoke. The issue here is that it just refuses to acknowledge that an incident had been created, ergo a trigger should happen and invoke the logic app - but it simply does not. As if the logic app hung up and never got the message about the trigger.

Only after you restart the logic app in the portal does it acknowledge that it had received a trigger (albeit only the most recent ones).

There is no way of knowing when that happens as there is no error in the trigger tab, the only way to know this is that operations tells you that something does not work anymore. Other way is to monitor if a logic app had invoked over the past n days - but that's absurd - so I should expect a logic app to not work for a day over the span of a few months? Sorry but that's not a solution.

Especially if those logic apps are used to automate critical security issues such as an account breach and e.g. MFA needs to be reset, sessions dropped. etc.

Because this is 'untraceable' on as to when it starts not to work - unless you have a trigger every second ;) - this easily breaches the consumption logic app SLA ( 99.9% iirc ) because unless a user does something about it - it stops working indefinitely. For our example the logic app did not work until we had restarted it manually for OVER 8 DAYS - which means that the SLA was breached and if this was the only occurence this year it would stand at: 97.7915 %.

v-muuppugund commented 7 months ago

Hi @Kaloszer / @bisskar , I am unable to replicate the issue from my end, if you have log details or traces,please share with me,else will reach team on this issue they will check from backend.

Kaloszer commented 7 months ago

@v-muuppugund What traces? What logs? I'm sorry but I don't think this is something I can get as I'm not a MSFT employee -> the issue isn't about the logic app itself but about the trigger. As I mentioned in my earlier comment:

Because this is 'untraceable' on as to when it starts not to work - unless you have a trigger every second ;) - this easily breaches the consumption logic app SLA ( 99.9% iirc ) because unless a user does something about it - it stops working indefinitely. For our example the logic app did not work until we had restarted it manually for OVER 8 DAYS - which means that the SLA was breached and if this was the only occurence this year it would stand at: 97.7915 %.

We do not have a way to tell that it stops working, we do not get any errors - either in the trigger part of the portal, nor was there any outage reported.

v-muuppugund commented 7 months ago

@v-muuppugund What traces? What logs? I'm sorry but I don't think this is something I can get as I'm not a MSFT employee -> the issue isn't about the logic app itself but about the trigger. As I mentioned in my earlier comment:

Because this is 'untraceable' on as to when it starts not to work - unless you have a trigger every second ;) - this easily breaches the consumption logic app SLA ( 99.9% iirc ) because unless a user does something about it - it stops working indefinitely. For our example the logic app did not work until we had restarted it manually for OVER 8 DAYS - which means that the SLA was breached and if this was the only occurence this year it would stand at: 97.7915 %.

We do not have a way to tell that it stops working, we do not get any errors - either in the trigger part of the portal, nor was there any outage reported.

Hi @Kaloszer ,yes ,got you ,We can get some application insights logs/traces i mean,so we need to reach backend to get more insights of this issue.

v-sudkharat commented 7 months ago

@Kaloszer, please let us know your response. Thanks!

v-muuppugund commented 7 months ago

@Kaloszer ,As we are unable to replicate the issue ,we request you raise the support case in Azure portal, so we are closing this issue.

Kaloszer commented 7 months ago

@v-muuppugund @v-sudkharat

Not an acceptable solution, re-open. Again this is a big issue, it breaches the services' SLA.

v-muuppugund commented 7 months ago

@v-muuppugund @v-sudkharat

Not an acceptable solution, re-open. Again this is a big issue, it breaches the services' SLA.

@Kaloszer , Sure,reopened the issue

v-muuppugund commented 6 months ago

Hi @Kaloszer ,As discussed on Monday, will be raising ICM for this issue and will post updates over an email ,as per our standard operating procedures. If you still need support for this issue(https://github.com/Azure/Azure-Sentinel/issues/10106), feel free to re-open at any time. Thank you for your co-operation!

Kaloszer commented 6 months ago

@v-muuppugund Before closing the issue, provide a tracking Id.

v-muuppugund commented 6 months ago

@v-muuppugund Before closing the issue, provide a tracking Id.

Sure @Kaloszer ,We need to raise a internal support case,so team will pick up had a discussion with CSS team

JonnyG365 commented 6 months ago

@Kaloszer we are seeing a very similar issue I think there is some kind of capacity problem in at the very least North Central US region. This article is very helpful Monitor the health of your Microsoft Sentinel automation rules and playbooks | Microsoft Learn. Once you have it configured you can query the Sentinel Health logs and see playbooks failing to even get kicked off.

Here is an example of what we are seeing image

Kaloszer commented 6 months ago

@JonnyG365 - in my case the runs that should've occurred do not even show up in that, as if that trigger never fired :(. So it would seem that the Automation Rule was somehow bound to the Logic App's state - and because it had hung, it never fired. Note that when the logic app was restarted, one of the missing triggers had fired, albeit it doesn't show as Fired but as null

image

JonnyG365 commented 6 months ago

@Kaloszer oh good so I've found another problem :)

v-muuppugund commented 6 months ago

@Kaloszer ,As we are unable to raise ICM independently, so reached CSS team and they suggested support case from azure subscription, so they can assist you.

v-sudkharat commented 6 months ago

shared response in comment -https://github.com/Azure/Azure-Sentinel/issues/10107#issuecomment-2074880736