user persistence in auth database in between `beforeCreate` and `beforeSignIn` is different in emulator and production

DibyodyutiMondal commented 1 year ago

Reference: https://github.com/firebase/firebase-tools/issues/5327#issuecomment-1370172541

Current behaviour: Suppose I set some persistent claims in beforeCreate blocking function. After that, inside beforeSignIn blocking function, if I use getAuth().getUser(user.uid), it will work in the emulator, but not in production. As the referenced comment notes: In the emulator, the user is persisted to the db in-between the function calls; whereas in production, auth waits for all the blocking functions to complete before it persists the user to the db.

IMO, it would be best that the production behavior matches the emulator behavior and makes writes in-between the function calls. But in case it's not a feasible option to implement that, this difference should at least be mentioned clearly in the documentation of the blocking functions. Because there is no way to catch this bug a dev can ignorantly introduce until after the code is running in production - which is what happened with me. And even after watching the cloud logs, and redeploying multiple times, I was left with the question - "but it works fine on my local pc, so where's the problem?"

DibyodyutiMondal commented 1 year ago

I must say that this issue is turning out to be deeper than expected.

I set up 3 functions for testing:

beforeCreate : BLOCKING sets a persistent custom claim on the user received from the function arguments
beforeSignIn : BLOCKING logs the user and context function arguments. Also sends a request to the auth database to see if the user is added. catches and logs the error so the function completes and user is not blocked.
onCreate: NON BLOCKING does the same thing as beforeSignIn

For the purpose of testing, I created a website and implemented google login. Just an empty page with 2 buttons that call firebase auth's google sign in, and firebase auth signout respectively. This allowed me to clear the token and test multiple times quickly. The website was then deployed to firebase hosting for the production tests. This website was required for the emulator tests as well, because using the emulator ui to create a user only creates a user; which mean that the beforeSignIn function is not called.

After deploying, I set the triggers in the firebase console, then signed in and created a new user, and checked the logs. Here is what I found:

beforeSignIn: a. The user argument does contain the custom claims, both in production and the emulator b. In production, the user does not exist in the auth database. Calling getAuth().getUser(id) results in an error. But in the emulator, the user can be retrieved. And the retrieved user does contain the custom claims.
beforeCreate: a. The user argument does not contain the custom claims in production. However, if we call getAuth().getUser(id), the result will contain the custom claims. But in the emulator, both the user argument and the retrieved user from the db contain the custom claims.

\ It is not possible for me to discern which of these differences in behaviour are intended and which are bugs. All I know is that

it is not documented
it cannot be caught in automated testing
the obvious logical path works in the emulator, but not in production (I have a rather lengthy explanation for this in the next comment in the thread), which means developers will very likely make the mistakes I made.
Having to delegate even a part of the setup process to onCreate completely defeats the existence of the blocking functions, because the client still needs to pick and choose which id tokens it can accept because the user initialization is pending.

\ Following is the code for 3 functions (For anyone copying this: remember to change the region):

import { getAuth } from 'firebase-admin/auth';
import * as functions from 'firebase-functions';

const cloud = {
  functions: functions.region('asia-south1'),
  error: functions.https.HttpsError,
  logger: functions.logger
} as const;

export const authBugTest = {

  beforeCreate: cloud.functions.auth.user().beforeCreate(async user => {
    return {
      customClaims: {
        someClaim: true
      }
    }
  }),

  beforeSignIn: cloud.functions.auth.user().beforeSignIn(async (user, context) => {
    cloud.logger.log('ARGS', 'USER', user);
    cloud.logger.log('ARGS', 'CONTEXT', context);

    const db = await getAuth().getUser(user.uid)
      .catch(e => cloud.logger.error(e));
    cloud.logger.log('DB', 'USER', db)
  }),

  onCreate: cloud.functions.auth.user().onCreate(async (user, context) => {
    cloud.logger.log('ARGS', 'USER', user);
    cloud.logger.log('ARGS', 'CONTEXT', context);

    const db = await getAuth().getUser(user.uid)
      .catch(e => cloud.logger.error(e));
    cloud.logger.log('DB', 'USER', db)
  })
};

DibyodyutiMondal commented 1 year ago

Consider my use case: Assume a brand new user comes in. And depending on some business logic, I must persist some claims on the user. Additionally, I may need to change properties of the user, and not just the ones allowed in the blocking functions. Several of the changes which are part of the entire account setup process require the use of getAuth()... (for example, setting a phone number), but those functions will only work on users already created. In other words - we can't use them for user changes inside the blocking functions.

Assuming we have no idea of the emulator's difference in behaviour, how do we implement this?

beforeCreate: brand new user has come in. You can set some claims, but you can't make any major changes because the user has not been created. So, we set a special accountSettingUp: true claim in addition to the other claims. The documentation says that any claims returned by this function are "persisted" to the user after the function.
beforeSignIn: since I'm signing in, I'm assuming that the user has already been created by this point. Additionally, since the documentation for beforeCreate uses the word "persists", it further encourages the idea that the user has been added to the database by this point. The emulator's behaviour also supports this idea. So we check for the setup claim we set before. If it exists we do the rest of the setup inside beforeSignin, and then remove the claim by overwriting the customClaims returned from the function. It's the perfect solution. All of the setup was completed before the token was returned to client. Amazing!
Except until we deploy to production and realize that this function cannot help with setup at all.
client: We now know we must move the logic to onCreate, but this means we also have to make changes to the client. The client must check for the setup claim in the token received after sign in and ignore it if it exists. It must also poll for the id token until the setup claim disappears, and then continue as normal.
onCreate: Since we already get the user as part of the arguments, no developer would think that we need to call getAuth().getUser(id) to get a fully updated user. In the emulator, the user argument also includes the custom claims, so that encourages the idea that the user passed into the function is updated. So we check for the setup claim, and if it exists, we carry out setup. It works!
Except until we deploy into production, and we realize that setup never runs. Because in production, the user argument in onCreate does not contain the claims that were set in the blocking functions, so the setup flag does not exist. So, even though we get the user as a parameter, we must make a network call to make things work.

It was quite a journey investigating this. A product presentation nearly got delayed hours before the presentation itself - just because the auth feature would not work in production, and that meant, despite being feature-complete for the milestone, the entire app was essentially in the bin and we couldn't figure out why.

Coffee flowed through the corridors, wars were fought, and tears were shed, over completely unrelated lines of code. But the look on people's faces when we found out it wasn't us that was the problem? And that the fix - after hours of mad debugging just before a presentation - was just 1 line of extra code? Priceless! 😂

So...

This was done intentionally as a means to simplify the emulator logic, and we accepted that this would cause a divergence between production and emulator environment. If that is causing any issues for you, feel free to request a feature change / bug for that. Thanks!

Yep, it definitely caused issues for us.

lisajian commented 1 year ago

Thanks for raising your concern. The documentation for the Auth Emulator has been updated to describe this difference in behavior. I'll raise your suggestion about how production behavior should match emulator behavior with the respective team. For other folks viewing this thread and that also want this feature request, including a 👍 on this issue can help us prioritize adding this to the roadmap.

DibyodyutiMondal commented 1 year ago

@lisajian The documentation should also mention that the user argument passed into onCreate will not contain claims set in the beforeCreate/beforeSignIn functions - they MUST query getAuth().getUser() for the real picture.

(in production)

onra2 commented 1 year ago

my blocking functions don't even trigger in emulator... onCreate works: export const onUserCreated = functions.region("europe-west1").auth.user().onCreate(async (user) => {}); but beforeCreate doesn't trigger/enter. Is there something that has to be done for it to work in emulator?

EDIT: found my bug, when i add "region("europe-west1")" it's not triggering in emulator (can't test in prod right now). If i remove it, it works fine. Should i also make an issue for this?

firebase / firebase-tools

user persistence in auth database in between `beforeCreate` and `beforeSignIn` is different in emulator and production #5426