ProjectSidewalk / SidewalkWebpage

Project Sidewalk web page
http://projectsidewalk.org
MIT License
84 stars 24 forks source link

Uniform login across cities #1381

Open jonfroehlich opened 5 years ago

jonfroehlich commented 5 years ago

would be nice to have:

Essentially, to the user, everything should look unified even if the backend is modularized and split.

jonfroehlich commented 5 years ago

This is now really important since we have links on each deployment city that links to each other city and when you do switch cities you have to register a new account (which is very strange)

misaugstad commented 5 years ago

Yeah this is one of those large projects we could assign to a talented undergrad.

andrew4699 commented 5 years ago

A simple solution that avoids the mess of making a proper distributed system is to just modify the code for sign in/admin panel so that they connect to all city databases and query all of them. This also has the added benefit of letting us keep our current infrastructure, keeping it easy to deploy different versions of the site to different cities, and keeping inter-city data physically independent.

Another solution that I like less is to have a single database that all cities use. For the tables that need to, add a city column to keep track of the city the user is auditing (although as I'm writing this I'm realizing that we may not even need to keep track of that because things like pano IDs should contain that information). This would sting more as time goes on because this single database will grow too big and/or our throughput will be too large to handle it.

tongning commented 5 years ago

Just gonna toss in my two cents here, I like the solution 1 idea of having each server utilize the other servers to authenticate. While I'm not super familiar with modern authentication mechanisms, I think this might be a good place to utilize JSON Web Tokens (JWT). If my understanding of JWT is correct, it lets us store user details (e.g. username, email address, logged-in status) in a cookie on the client side, but with an added signature from the server so that the user cannot tamper with it. If we use the same signing key across all sites, then I think every site will be able to read/verify a JWT cookie set by any of the others. This would let us avoid having to set up cross-site database access.

The JWT idea was briefly brought up #1347, but I think it might be very relevant to this issue as well.

misaugstad commented 5 years ago

I think the first solution sounds decent... And I don't think that checking all other servers for authentication would take too much time..? Some things that we should keep in mind while implementing:

  1. We need to make sure that if one server goes down, it doesn't prevent all the others from authenticating?
  2. What do we do about usernames or email addresses that currently have an account on multiple servers right now?
  3. Lots of data that we record in the database has to be tied to a user_id... So I think they have to be tied to some credentials in their own database.

Building off that last point, what if instead of checking all the servers for credentials when someone logs in, we just create an account with the same credentials on every server when someone signs up?

andrew4699 commented 5 years ago
  1. What do we do about usernames or email addresses that currently have an account on multiple servers right now?

I imagine the number of people with duplicates isn't that high if you exclude the Sidewalk team.

We could do this in 2 steps where in step 1 we prevent people from making duplicates across the servers and send people with current duplicates an email with instructions on changing their email/username, and in step 2 we change the email/username of some duplicate users and give them instructions to recover their account.

  1. Lots of data that we record in the database has to be tied to a user_id... So I think they have to be tied to some credentials in their own database.

Building off that last point, what if instead of checking all the servers for credentials when someone logs in, we just create an account with the same credentials on every server when someone signs up?

That's a good point. I like the idea of replicating their accounts on all the databases. That would solve this user_id issue and make logging in faster since they would only need to check 1 database. Something to consider is that now our change password/make user an admin/delete account/other user-updating actions have to update all the servers. We would also have to coordinate unique account IDs across all of our servers.


Point 3 also brings up a lot of concurrency issues.

It sounds like if we let ourselves have 1 master server, we could solve all of these by making it generate unique IDs and act as a "lock manager" for accounts. I'm not great with distributed systems though so that might be an awful solution.

Perhaps Postgres has some fancy replication features that enforce consistency?

misaugstad commented 5 years ago

I imagine the number of people with duplicates isn't that high if you exclude the Sidewalk team.

Probably right. I just did a quick check and found 3 non-researcher email addresses that are present in both the DC and Seattle databases.

It sounds like if we let ourselves have 1 master server, we could solve all of these by making it generate unique IDs and act as a "lock manager" for accounts. I'm not great with distributed systems though so that might be an awful solution.

This sounds the most promising to me. I'm also not an expert on distributed systems, so I think it would be best to research what others are doing in this regard (stack exchange, for example?). I feel like a dedicated authentication server (that then propagates the info out) may be our best bet though. Although the single point of failure would prevent users from authenticating everywhere, hopefully the authentication server wouldn't be doing too much and would be unlikely to fail frequently :) It seems like a central server for authentication would solve a lot of potential concurrency issues.

Again, I think researching what others do in our situation is really important. If you can't find answers online, don't hesitate to post on StackOverflow, etc.

Also would love to know if @athersharif has any thoughts about this thread?

andrew4699 commented 5 years ago

Gonna take a break from this one since the quarter is almost over. I think when we continue it, we should try to keep it simple since it only is affecting a small number of users. We could also do a simple version first and then transition to a properly distributed system once we need to handle more users.

jonfroehlich commented 1 year ago

This came up again for our Taiwan deployments where it particularly makes sense to have a unified login:

Btw, we noticed that each server requires creating an account, plus the accounts are separate between the test server and the production. Is it possible for you guys to set it up so we don't need to create separate accounts for each?

misaugstad commented 10 months ago

After our most recent PR (#3429) and the accompanying server-side changes, this should soon be possible! We are now going to have a single database, where each city is in it's own schema. Since all cities will be sharing a database, we should be able to add an additional schema for user authentication, and every city should be able to share it!

There will be some pain as we attempt to merge accounts for the same user that were created in multiple cities (possibly with different usernames/emails/passwords), but the pain will be worth it!!

davphan commented 5 months ago

A couple clarifying questions:

I feel like a dedicated authentication server (that then propagates the info out) may be our best bet though. Although the single point of failure would prevent users from authenticating everywhere, hopefully the authentication server wouldn't be doing too much and would be unlikely to fail frequently :) It seems like a central server for authentication would solve a lot of potential concurrency issues.

  1. Now that everything has been relocated to a single database, could I approach this as having a single authentication schema which stores unified login data and references existing logins from each city schema? Then, the sign-in/sign-up pages will all use a centralized login system which references this new authentication schema?

uniform backend dashboard (to track stats across cities)

  1. Would it be better to modify the existing user dashboard to include all city data? Or a new dashboard so users will have a per-city dashboard AND a global dashboard to view?

  2. How much of the backend data will be affected by this PR? For example, will leaderboard stats also need to be combined for each user across cities? Or will it be limited to just login and user dashboard data being unified?

misaugstad commented 5 months ago

could I approach this as having a single authentication schema which stores unified login data and references existing logins from each city schema? Then, the sign-in/sign-up pages will all use a centralized login system which references this new authentication schema?

Totally agree that we should have a single authentication schema within the same database. The question is whether this authentication system references existing logins from each city's schema... I think that what we ultimately want to do is to transfer the logins to a central schema, and then fully remove the authentication data from the schemas for the individual cities.

Would it be better to modify the existing user dashboard to include all city data? ... How much of the backend data will be affected by this PR?

Let's keep the PR as small as possible. We can create new issues for things like unified view of user data, unified admin data, Gallery, etc. I think that this PR should focus solely on the authentication:

  1. Signing up
  2. Logging in
  3. Resetting passwords
  4. Automatically merging existing accounts from different cities with the same email address.
davphan commented 5 months ago

@misaugstad Here's some notes on what I've learned about the authentication system with Play Silhouette that seem relevant to this ticket:

SilhouetteModule.scala

This file defines the Silhouette Environments, Services, and Providers in the authentication system.

Environment

The Silhouette environment is an object which defines the active user, authenticator service, credentials provider, and event bus.

AuthenticatorService

Uses the SessionAuthenticator implementation which generates a session with an active user after the credentials of the user are verified. This session is used to track all subsequent actions by the user and stored in a Play Session Cookie. (to me it almost more like an authorization tool rather than authentication?)

CredentialsProvider

Checks the credentials given by the user and authenticates it by checking against the credentials stored in database.

The diagram below illustrates how the system connects. The green blocks are the database tables that the authentication system interacts with, and the orange blocks are the implementations of the DAOs.

image

Some things I couldn't figure out about that seem relevant:

Silhouette framework documentation Silhouette Framework Implementation

davphan commented 5 months ago

Here's a plan for tackling this ticket and some questions:

  1. Create a new authentication/login schema with the following tables using the same table layout as the other server schemas:
    • user_password_info
    • login_info
    • user_login_info
    • sidewalk_user
  2. Change all references to these tables within the authentication scheme to refer to this new schema rather than the current server-specific schemas. This should ensure that signup, login, and forget password functions the same as before, just in a new schema.
  3. Migrate data from server-specific schemas to new login schema.
    • One approach could be to gradually move data once people sign in, like have a check on each sign-in along the lines of:
      if (login not found) {
      check all schemas for login info
      if (login info found) {
      move to new schema
      delete from old schema
      } else {
      throw InvalidLoginException
      }
      } else {
      login
      }

      However, this seems slow and adds unnecessary runtime on every failed login attempt, and also keeps old unused account data in the old schemas.

    • Another approach could be to have a script similar to the label clustering code where it runs periodically in the background, checking if there is login data in the old schemas and moving that data to the new schema?

      Automatically merging existing accounts from different cities with the same email address.

    • Does this process have to be automatic/repeated? Or can we do a one-time transfer of all login data to the new schema and then drop the tables from the old database?
misaugstad commented 5 months ago

Here's some notes on what I've learned about the authentication system with Play Silhouette that seem relevant to this ticket:

Thank you so much!! Will take a look later!

Create a new authentication/login schema with the following tables using the same table layout as the other server schemas

I like it!

Change all references to these tables within the authentication scheme to refer to this new schema rather than the current server-specific schemas

I also like it!

Automatically merging existing accounts from different cities with the same email address.

Does this process have to be automatic/repeated? Or can we do a one-time transfer of all login data to the new schema and then drop the tables from the old database?

I imagine that this is a one-time thing! I think that I would make a database dump as a backup for every city, then we do the one-time transfer and drop the tables from the city-specific schemas. Ideally the application should function like it's had unified login from the very beginning, so doing a one-time transfer and not having to continually check for data in the old schemas is ideal!

davphan commented 4 months ago

Offboarding soon so here's an update on where this ticket is at:

Commit with most recent changes: https://github.com/ProjectSidewalk/SidewalkWebpage/commit/9999808e80d7d8dccfd14c23dbcc8e63fbbdc5fe

Done so far:

To-do

  1. Debug current set-up of moving login data from ONE database to the Login Schema and seeing if the webpage still works.
  2. Move MULTIPLE city's login data to the Login Schema a. Duplicate users (duplicate emails) in cities should use login info from most recently logged into user b. Check that login works the same in each city c. Check that "Forget Password" works the same in each city
  3. Reset database dumps and test running PS with multiple databases to make sure that data is moved properly

Additional Notes

misaugstad commented 4 months ago

Thank you for this, this is super helpful!! Some other stuff I thought of while reading this:

@davphan one question: Why are the evolutions all split into different files? Is that just to make it easier to code and understand what's going on? Or was there a more practical reason?

davphan commented 4 months ago

Specifically, we have an automated process that saves a token when users try to reset their pw, and then the token gets deleted after some amount of time. This is happening in every city, and we should figure out how we want to consolidate them.

That makes much more sense now, there's an auth_tokens table accessed in the same DBTableDefinitions.scala file which I tried playing around with and couldn't figure out how/when it was accessed/modified, so I assumed it was an artifact of the Silhouette framework. That table is probably where'd I'd start looking then to consolidate that token information, and that table definition:

https://github.com/ProjectSidewalk/SidewalkWebpage/blob/7efb0e96a82f5caef99a977fde8aa97de5dab4c1/app/models/daos/slick/DBTableDefinitions.scala#L47

will need to be modified, on top of adding that table to the Login Schema and any other plain SQL queries that reference that table.

Why are the evolutions all split into different files? Is that just to make it easier to code and understand what's going on? Or was there a more practical reason?

I started off having everything in one SQL file, but there were issues when starting the website that seemed like the queries weren't being run in order (like the table creations would fail saying the schema had not been created, etc.). When I split queries up so that the necessary preceding queries were in previous SQL files, those errors disappeared. I'm not super familiar with how evolution files are compiled in scala so I'm not sure if that fixed it or if there was something else involved that I'm not aware of.

misaugstad commented 1 week ago

Thank you so much for all of your work on this @davphan! You really were 90% of the way there in terms of getting the schemas set up and such! I've got it working on my local dev environment when moving the data from a single city into this schema. Now it's all about merging the existing data!

Just to continue to remind myself: I haven't tested password resets yet. Signing in/up/out is working correctly though!

Unfortunately, I've come to the conclusion that we're not going to be able to do this automagically through an evolutions file. We're going to have a short downtime when I will migrate the data to the new centralized authentication schema. Assuming that nothing goes wrong, the downtime should be incredibly short!

The plan right now is to write all auth data from each city to a CSV, write a Python script that will merge the data and output a new CSV, then import that data into the new schema.

Thankfully, I can essentially do that whole process locally using the production data as it exists now, and I can figure out every edge cases that we currently have. New edge cases don't pop up frequently, so I should be able to fully test everything before we actually try it out on prod. I've already started working through some edge cases and cleaning up some anomalies in the data.

And then we can of course do the migration of the data on the test servers as a dry run of prod. It won't have all the data from prod (though I suppose that we could copy all the prod dbs over if we really wanted to :thinking:), but it will give us the opportunity to debug any issues related to database permissions, etc. that I wouldn't be able to test on my local machine.

I'm going to be incredibly careful with this, because we're talking about users being able to login! I plan to ensure that absolutely nothing breaks during this migration :)

misaugstad commented 1 week ago

Lots of progress today! I believe that I've fixed all the anomalies in the data, and have the script set up to clean and prep the data. What's left:

  1. Finalize the script: it either needs to write the data to a CSV and then we import the CSVs into the db using psql, or the Python script could write the data into the dbs directly. Haven't really decided which is easier.
  2. Test out password resets
  3. Do a full dry run locally with most of the cities' data
  4. Set things up so that I don't break everyone's dev env through this
  5. Push to test and do a full dry run there
  6. Do the migration on prod