briandelmsft / STAT-Function

Azure Function for the Microsoft Sentinel Triage AssistanT (STAT)
https://aka.ms/mstat
MIT License
8 stars 1 forks source link

[Support] Base Module is Failing #51

Closed mikedizzle closed 7 months ago

mikedizzle commented 7 months ago

Is there a direction you can point me to for the Base Module consistantly throwing a 503 error? I have connected to the log stream in the function app, but that doesn't seem too produce any output.

briandelmsft commented 7 months ago

Hi @mikedizzle a 5xx series error typically indicates a problem with the function itself, or the custom connectors ability to get to the function.

Here's a few things that I'd recommend looking at:

  1. Is the function app running?
  2. Is the function app application setting WEBSITE_RUN_FROM_PACKAGE pointed at a a version of STAT, can you download the file successfully?
  3. Have any network access restrictions been set on the function app? (Settings -> Networking -> Access restrictions) Is public access enabled? Have any network, is the unmatched rule action allow? Have any IP ranges been added to the list? Access can potentially be restricted here, but some restrictions will break functionality.
  4. Is the custom connector pointed at the right function app? Locate the Logic apps custom connector object, on the overview page click Edit from the toolbar, on the initial page that opens (General) look to the very bottom for a host property, is the host name the same as the function app you have deployed?
mikedizzle commented 7 months ago
  1. Yes
  2. Yes
  3. No
  4. yes

It does appear that the function app just kinda died. There's no more diagnostic logging being sent (last entry was 11/13 at noon). We aren't aware of anything changing. If you browse to it directly it says 503 as well.

While I could download the release file, I changed the WEBSITE_RUN_FROM_PACKAGE to 1.5.5 instead of 1.5.10 just to see and that didn't help either.

briandelmsft commented 7 months ago

@mikedizzle I am seeing some random 500's in my lab, not 503 though. There may be an issue with the github hosting of the zip. Can you copy the zip into your own Azure blob storage and generate a read only shared access signature for it, then use that in the WEBSITE_RUN_FROM_PACKAGE ? When I did that my random 500's went away.

mikedizzle commented 7 months ago

Not dice for me. I'm not sure how I can get some actual logs out of the function app to tell me where it's failing.

briandelmsft commented 7 months ago

@mikedizzle you can add application insights to the function, however, given the 503 the function may not be executing at all and application insights may not log much. To add application insights on the overview click on the 'modules' trigger, and from there click Monitor and you'll see the option to add it.

If you try to invoke the function via the web browser, how long does it take to fail?

https://functionname.azurewebsites.net/api/modules/test?code= <functioncode>

it should return a 400 with a body like this: image

I assume it will throw a 503 for you, but I'm curious how long it will take

mikedizzle commented 7 months ago

@briandelmsft I have to admit I am not a developer. I wouldn't know what goes in <fucntioncode>. It's also supper weird that it just stopped working. When I try to do a new deployment it looks like it fails on creating the functions. I get the following error on that deployment step:

{ "code": "BadRequest", "message": "Encountered an error (ServiceUnavailable) from host runtime.", "details": [ { "message": "Encountered an error (ServiceUnavailable) from host runtime." }, { "code": "BadRequest" }, {} ] }

I tried both with the files off of github and in my own Azure Storage.

It's almost like there was an Azure update that broke things.

briandelmsft commented 7 months ago

@mikedizzle you can get the function code by going to the function app, under Functions -> App Keys, copy the 'default' value.

That's very strange that you can't create a new deployment either, I just tested one without issue. What datacenter region are you deploying to?

Were you able to get application insights enabled? That is helpful in troubleshooting 500 errors, I'm not sure how much insight it provides to a 503.

Once enabled (or if you haven't, you do it in the same place), go to invocations and more: image

Locate a failing run and click into it, hopefully a 503 shows up: image

and then provide the full details output (feel free to redact anything if there happens to be anything sensitive): image

I've tried a few ways to introduce a 503, by intentionally adding an unhandled exception that would cause the function to terminate without a response, and by removing all the dependent packages from the zip file. Both gave me 500 errors only.

mikedizzle commented 7 months ago

I'll try this later tonight or tomorrow. I am trying to deploy to US West 2. That's where most/all of my Az resources are.

mikedizzle commented 7 months ago

It's so weird. No invocations since it died on 11/13 right around 2000 UTC. The URI test in a post above is the same service not available. So very weird.

image

briandelmsft commented 7 months ago

@mikedizzle Sure looks like a functions issue, if there's no invocation the STAT code isn't even being executed. And if you get the same behavior through the browser it's not an issue with the logic apps connector either.

I would recommend doing a resync on the function trigger. I don't think there's a way to do this via the GUI, it's documented below but I can put something together to make it easier to do tomorrow.

https://github.com/Azure/azure-functions-host/wiki/Admin-API

Also, no issues for me deploying to West US2

briandelmsft commented 7 months ago

@mikedizzle The easiest way to do the resync would be with something like curl, but any HTTP client could work, you just have to ensure to specify a POST method. For this you'll need the 'master' function key, not the default, it's found in the same place as the default though.

curl -X POST https://{functionappname}.azurewebsites.net/admin/host/synctriggers?code=<master_key>
mikedizzle commented 7 months ago

@briandelmsft Ok. That produced the same 503. Sounds like I need to open a ticket. It's so very weird that I cannot even deploy a new instance.

mikedizzle commented 7 months ago

@briandelmsft

Does this show a runtime version in your deployment(s)?

image

briandelmsft commented 7 months ago

@briandelmsft Ok. That produced the same 503. Sounds like I need to open a ticket. It's so very weird that I cannot even deploy a new instance.

Yes I think that would be the best route, something is fundamentally wrong with the http listener, but to hear it's also not specific to just this instance of the function but new deployments as well is very strange

Does this show a runtime version in your deployment(s)?

Yes, it does show --- to me for a few seconds as its loading... but after about 5 seconds I get a version

image

mikedizzle commented 7 months ago

@briandelmsft Yeah, it's weird the version never shows up for me. I'm going to close this as it seems to be my issue. Thanks for your help.

mikedizzle commented 7 months ago

@briandelmsft - it deploys fine to a non west us 2 region for me. Go figure. So odd. If I get resolution from MS support, I'll post back here.

mikedizzle commented 7 months ago

@briandelmsft - So there was a health alert as a result of my ticket. Apparently this had been affecting a handful of subscriptions in a few regions starting on 11/09. Thanks for taking so much time with me. Lesson learned is that I should start with an Azure ticket. See below:

Preliminary Root Cause: We have identified that this issue was caused due to a code regression on a dependent Service, which led to compatibility issue between multiple code versions running on the scale unit.

briandelmsft commented 7 months ago

@mikedizzle interesting, thanks for the update. Glad to hear you got to the bottom of it.