updating timer trigger did not properly reschedule the job

jamesdixon commented 7 years ago

Hi,

I have a job that runs weekly and processes payments. I typically have this job run on Tuesday at 9am UTC, but I had some fixes that I needed to push out and decided to delay it by a day. To do this, I used the debug console in Azure portal to open the function.json file in the online editor and changed Tue to Wed. Here's the resulting file:

{
  "bindings": [
      {
          "type": "timerTrigger",
          "name": "timerInfo",
          "direction": "in",
          "schedule": "0 0 9 * * Wed",
          "runOnStartup": false
      }
  ]
}

To my surprise, I woke up this morning to find that the job had run.

To further complicate things, I'm seeing the following output from the functions dashboard:

Related information

Provide any related information

Running Node.js
Functions runtime v1.0.10690

Update

I just did a test to see if the job "appears" to reload when changing the file and according to the logs, I do see that the functions runtime stopped and then restarted. I'm assuming that this is the job reloading.

Is there something with the way these timer triggers are scheduled that would have caused the previously scheduled job to remain even though I had updated function.json to a new schedule?

Thanks again

brettsam commented 7 years ago

Your job should have been rescheduled.

I can't see the image you posted (looks like Github is having some issues). Would you mind sharing your app name either directly or indirectly? And do you know roughly what time you would have changed your function.json file?

jamesdixon commented 7 years ago

@brettsam thanks for the reply: my site name is scout-api.scm.azurewebsites.net

I made the change around 9pm MDT on Monday, Feb 27th.

Some additional information: it definitely appears the job was cut off in the middle of processing as customers were being charged, but subsequent operations never happened.

jamesdixon commented 7 years ago

Also, the job in question was ProcessInvoices

brettsam commented 7 years ago

This is a Functions app, correct? How did you create it? I'm not seeing any logs in our backend and I'm trying to figure out why...

jamesdixon commented 7 years ago

It’s actually not a straight functions app. It’s the functions runtime running on Azure App Service.

On Mar 1, 2017, 12:32 PM -0700, Brett Samblanet notifications@github.com, wrote:

This is a Functions app, correct? How did you create it? I'm not seeing any logs in our backend and I'm trying to figure out why... — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

brettsam commented 7 years ago

Curious -- did you do anything to explicitly turn off the logs? I thought that we'd still be logging those events even if you deployed it yourself.

But that's not the real issue :-). I have a hunch about what may have happened as I just saw something similar in another bug I was fixing, but I'd like to confirm. Do you have any host logs for this time period? They may be in D:\home\LogFiles\Application\Functions\Host.

Can you verify this was the series of events (times in UTC)?

Function last ran Tue 2/21 @ 9am. Next run scheduled for Tue 2/28 @ 9am
On Tue 2/28 @ 3am, you changed the cron expression to only run on Weds -- expecting that the next invocation would be Wed 2/29 @ 9am
Instead, the timer fired as soon as you changed the value -- on Tue 2/28 @ 3am
This invocation failed (I'm not sure why) -- but it never should have run.
(I'm not sure about this one) -- The next run happened on Wed 3/1 @ 9am

I think the problem here comes from the fact that when the schedule changes, we calculate the next expected invocation based on your last invocation. In that case, we would have calculated it as Wed 2/22 @ 9am (the immediate Weds following your last invocation). When we see that, we think you are "past due" and immediately run the function.

If you have logs around the time you changed the schedule, that'd help verify. But I actually just sent a PR where I called out this behavior and asked if it is by-design. It seems like we need to fix this.

jamesdixon commented 7 years ago

@brettsam sorry for the delayed reply. This turned out to a fairly big deal and required a lot of "manual fixing" to make things right again. There was some very bizarre behavior that I'll go into after I answer your questions.

Your series of events is mostly correct. The timer didn't immediately fire after I made the change, it simply stuck with the existing schedule and fired at 9am UTC on 2/28. After I saw this happened, I reverted my function.json back to Tuesday's to avoid having it run again on Wednesday. Hopefully, this won't screw things up for this upcoming week!

I mentioned some very bizarre behavior that occurred after this job ran. The first thing you are going to think it "user error," which is also what I thought. I figured I must have done something wrong, but all evidence is pointing elsewhere. Let me try and break this down without making it overwhelming:

As you already know, this job that processes payments runs every Tuesday at 9am UTC. In our particular situation, we have two environments setup on Azure App Service -- staging and production. When I do a deployment, I simply swap the environments. We use App Service's application settings (env variables) to handle anything environment specific.

With that in mind, here's what I noticed after this job ran:

Emails receipts were sent to customers indicating that they had paid or they payment had failed. This is normal except for the fact that we noticed that all of the emails went to our test email address and not the customer's email address. What's fishy about this is that email address is set as part of the application settings and ONLY exists for staging. We have an Email job that runs based on a queue trigger. All of the emails we send use this job. When the payment receipt emails were sent, they appeared to be using staging settings, but all other emails sent after this job was run were sent as expected, using production settings.
I checked Braintree, our payment provider, and sure enough, customers HAD been charged! The interesting thing here is we use an application setting to toggle between Braintree production and sandbox. Staging uses sandbox and Production obviously uses production.
Given the above, I started thinking that maybe somehow I accidentally applied the application settings from staging to production. I checked in Azure and all looked good!
The job was marked with a "never finished" status. I'm not sure exactly what this means, but I found things that appeared to be incomplete in the job. One of them is that our database is updated with payment transactions after we process the sale via Braintree. There was no record of this in our production database. However, and I didn't think of this until after I had restored our staging database, but I'm thinking that it may have not been in the production database because it somehow used the application setting for staging, which points to our staging database. I can't verify this, but given the other weirdness above, I think it's likely.
Last but not least, I checked the webjob dashboard on staging and the last time this job was run was back on 2/13, so the job definitely didn't run there.

I suppose one of my questions is there any sort of shared configuration/memory/space used by the functions runtime? Is there anything you can think of that could have caused this bizarre set of circumstances?

Hopefully, I've convinced you or have you at least considering that this wasn't user error. After this job ran, everything ran normally, but it definitely did some damage 😡 I should also mention that this job hasn't changed -- it's been running for 4 weeks with no issues.

Sorry for the novel and thank you!

brettsam commented 7 years ago

Let's try to break this down:

You said that the timer didn't fire immediately after you made the change, but in your screenshot above, it says the timer fired on 2/28 @ 2:57am -- which (I think) corresponds to the 2/27 9pm MDT time that you said you made the change to the schedule. That's assuming MDT is -6 from UTC. Are you saying that it also ran at 9am UTC later that day?

I'll try to think through the rest...

jamesdixon commented 7 years ago

@brettsam you are correct -- it does appear to have fired immediately. The confusing thing is that all of the emails that go out after this job is processed arrived in my inbox shortly after 2am MDT on 2/28, which is what I would expect if the job was processed on it's normal schedule of 9am UTC (MDT is -7). Technically there could have been a delay in processing the emails through my provider, but the timing would have to be very coincidental. Plus, I checked Sendgrid's incident history and there is nothing related to delayed emails on that day.

Long and short of it is a ton of bizarre things happened with this change!

jamesdixon commented 7 years ago

Btw, I forgot to answer your question about disabling logging. I'm not aware of anything I'd have done to disable logging; is there a place I should look?

brettsam commented 7 years ago

It sounds like your staging environment was still configured to run at the same time, which would explain some of what you saw, but not everything. I'll keep thinking through what could have happened -- it's possible that your production job kicked off when you changed the schedule but only partially ran before crashing -- which explains the 'never finished' dashboard log as well as some missing details in your database. But some other things don't really make sense. Do you have any logging going elsewhere that may have captured any exceptions that the 'never finished' job may have thrown?

As for our backend logging -- how did you deploy your function? Did you use these steps: https://github.com/Azure/azure-webjobs-sdk-script/wiki#option-2-no-http-functionwebhook-support? That would explain why I don't see your logs in our backend.

jamesdixon commented 7 years ago

Yeah, I never changed the staging schedule, so could be a possibility. I'll see if I can find any other exceptions that may have occurred.

And yes, that's how the functions are deployed. I'd like to move it over to a separate functions app at some point.

brettsam commented 7 years ago

I've just committed a change for your initial issue (the timer firing when you changed the schedule): https://github.com/Azure/azure-webjobs-sdk-extensions/pull/195.

I'm going to close this issue -- but if you're able to find any more logs that indicate what may have happened with the staging/prod issues, please open a new issue and we'll tackle it there. My current theory:

Your staging environment still had the old schedule, so it ran.
When your schedule changed, we incorrectly ran your function immediately (which is the fix above). Somewhere in the middle of processing, your function died, leaving your backend data in an unexpected state.
That doesn't explain why your staging environment didn't show that the timer had fired in the dashboard, though... that's what still confuses me.

jamesdixon commented 7 years ago

@brettsam sorry I never replied to this. I appreciate you submitting the fix.

A quick question: if I manually run the job, will it affect the scheduling of future jobs? I'm asking because our job didn't fully complete when it ran on its normal schedule last night. It was an error in our code, but I need to run it again. However, I'm a bit gun shy to run it manually as I'm afraid of affecting the future schedule.

Thanks! James

Azure / azure-functions-host

updating timer trigger did not properly reschedule the job #1246

Related information