firebase / firebase-admin-java

Firebase Admin Java SDK
https://firebase.google.com/docs/admin/setup
Apache License 2.0
543 stars 269 forks source link

Performance Issue - API calls are taking 3-5 secs #430

Closed deepfriedbrain closed 4 years ago

deepfriedbrain commented 4 years ago

Operating System version: MacOS 10.15.5 Firebase SDK version: 6.12.2 Library version: 6.12.2 Firebase Product: auth

I've migrated from Google Identity Toolkit to Firebase Auth. I'm using this library on my server, which is hosted on Google App Engine (Java).

I've observed that very often, a simple call auth.verifyIdToken has been taking 3-5 seconds (yes 3000-5000ms)!

Usually it would take less than 10ms, but very frequently, it would take 3-5 seconds to complete. It's happening way too frequently to ignore especially in the past few hours.

auth.getUser consistently takes even longer. The minimum that this API takes is about 700ms, which itself is a lot for the best case scenario.

Are you guys aware of this problem? I have the logs with timestamps captured. I can provide specific details you need.

hiranya911 commented 4 years ago

Did some rudimentary tests and all seems fine as far as I can tell. Tests were run on a laptop on my home wifi so the rpc latencies are a bit on the high side, but still looks quite reasonable.

Test 1

Following test calls verifyIdToken() 1000 times, and reports a performance summary.

  @Test
  public void testVerifyIdTokenPerformance() throws Exception {
    String customToken = auth.createCustomToken("testuser");
    final String idToken = signInWithCustomToken(customToken);
    DescriptiveStatistics stats = new DescriptiveStatistics();

    for (int i = 0; i < 1000; i++) {
      long time = timeIt(new Callable<FirebaseToken>() {
        @Override
        public FirebaseToken call() throws Exception {
          return auth.verifyIdToken(idToken);
        }
      });
      stats.addValue(time);
    }

    System.out.println("Min: " + stats.getMin());
    System.out.println("Max: " + stats.getMax());
    System.out.println("Average: " + stats.getMean());
    System.out.println("99p: " + stats.getPercentile(99));
  }

  private long timeIt(Callable<?> callable) {
    long start = System.nanoTime();
    try {
      callable.call();
    } catch (Exception e) {
      throw new RuntimeException(e);
    }

    return TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - start);
  }

And the results I get look quite good (all values are in milliseconds):

Min: 0.0
Max: 72.0
Average: 0.2569999999999967
99p: 3.0

The max value of 72ms is from the very first call to verifyIdToken(). This is the slowest since it needs to fetch public keys and cache them. All subsequent calls completed under 12ms, as they don't require any RPC calls.

Test 2

Following test calls getUser() 35 times and simply reports the time each call took:

  @Test
  public void testGetUserPerformance() {
    for (int i = 0; i < 35; i++) {
      long time = timeIt(new Callable<UserRecord>() {
        @Override
        public UserRecord call() throws Exception {
          return auth.getUser("testuser");
        }
      });
      System.out.println(time);
    }
  }

And the times reported are:

1111 242 224 236 222 779 277
205 207 239 241 213 214 191
205 243 229 264 212 195 227
211 220 229 215 218 221 221
270 197 212 209 207 226 214

Again the first call is the slowest (TCP handshake, SSL handshake, OAuth2 credentials fetch etc). But most of the subsequent calls completed under 250ms, except for one which took 779ms. That seems acceptable for a test conducted over wifi.

hiranya911 commented 4 years ago

Are you initializing a new FirebaseApp instance for each request? Or may be your traffic pattern is sporadic so that App Engine tears down your app between requests? If that's the case, most of your requests are going to see the initial load latency observed in the above tests.

deepfriedbrain commented 4 years ago

@hiranya911 First of all, thanks for investigating into this.

It is not always slow. It is slow intermittently and quite often. Your first test lasts about 2-3 secs, and second for about 30 secs. These are not indicative of real world usage.

I ran the same tests on my home wifi network. For the first test my numbers are better than yours, but for the second, they are much worse.

For verifyIdToken I ignored the first run number (considering it to be part of the warmup), but second run's results were better than yours. However, from third run onwards, all four numbers started to show as 0 all the time. Probably they were rounding off to 0ms.

For getUser, my results are not as flattering as yours. Median is around 700ms as I mentioned in my first post.

2192 715 1304 709 746 749 907 
688 859 742 682 684 678 722 
707 719 705 699 692 699 704 
690 693 1446 701 706 677 677 
688 680 695 688 682 691 714

I've started logging the request stats from production into my cloud datastore. The real picture will emerge soon. Based on the early results, I can tell that getUser is definitely slow!

In production, I've seen some requests taking abnormally long on a warmed up and running instance. Here's a summary of my "unscientific" observations for an app running on F1 instance when there's very little traffic.

FirebaseApp.initializeApp: 250ms (one-time at new instance startup)
FirebaseAuth.getInstance: 200ms for first call; ~0ms subsequently
verifyIdToken: usually <5ms, but intermittently hits 3000-5000ms
getUser: ranges from 250 - 1500ms (the best I've seen is around 230ms)

Let me now address the questions that you've posted.

Are you initializing a new FirebaseApp instance for each request?

No, I'm not. I have a login endpoint, which is defined as a @Singleton and it only gets initialized once. Inside the constructor of that servlet, I initialize the FirebaseApp. Here's my initialization code.

try {
    FileInputStream serviceAccount = new FileInputStream("<path to json>");
    FirebaseOptions options = new FirebaseOptions.Builder()
        .setCredentials(GoogleCredentials.fromStream(serviceAccount))
        .setDatabaseUrl("<firebase database url>").build();

    if (FirebaseApp.getApps().isEmpty()) {
        FirebaseApp.initializeApp(options);
    }
} catch (FileNotFoundException e) {
    // handle error
}

The initialization takes about 250 ms in production, and I've confirmed that it happens only once when a new instance of the app is started.

Or may be your traffic pattern is sporadic so that App Engine tears down your app between requests?

No, this is not the case because I was continuously monitoring the instance and also I have configured some minimum instances to keep running. I had confirmed that the instance was running and I was monitoring the logs when I was seeing the slowness.

If that's the case, most of your requests are going to see the initial load latency observed in the above tests.

I'm aware of this. In my case, after a new instance startup, it takes about 5000 ms to call just verifyIdToken, and subsequently it takes about 5 ms on average.

But the slowness is sporadic, which is why I'm concerned.

hiranya911 commented 4 years ago

These are not indicative of real world usage.

Yes, but we cannot really have a discussion about performance without establishing some sort of a baseline. These tests demonstrate how the SDK normally works.

getUser()

For getUser, my results are not as flattering as yours. Median is around 700ms as I mentioned in my first post.

That's just the rpc roundtrip time from you machine to the Google datacenter. The SDK doesn't add any extra overhead apart from the required json marshaling/unmarshalling time. Is your Firebase project located in a region close to you? You can further verify this by making some direct calls to the REST endpoint.

  @Test
  public void testGetUserRestPerformance() throws Exception {
    InputStream file = new FileInputStream(SERVICE_ACCT_FILE);
    final GoogleCredentials credentials = GoogleCredentials.fromStream(file)
        .createScoped(ImmutableList.of("https://www.googleapis.com/auth/cloud-platform"));
    DescriptiveStatistics stats = new DescriptiveStatistics();
    for (int i = 0; i < 35; i++) {
      long time = timeIt(new Callable<String>() {
        @Override
        public String call() throws Exception {
          return makeRestCall(credentials);
        }
      });
      System.out.print(time + " ");
      if (i % 7 == 6) {
        System.out.println();
      }
      stats.addValue(time);
    }

    System.out.println();
    System.out.println("Average: " + stats.getMean());
    System.out.println("90p: " + stats.getPercentile(90));
    System.out.println("95p: " + stats.getPercentile(95));
    System.out.println("99p: " + stats.getPercentile(99));
  }

  private String makeRestCall(GoogleCredentials credentials) throws Exception {
    HttpTransport transport = Utils.getDefaultTransport();
    HttpRequestFactory requestFactory = transport.createRequestFactory(
        new HttpCredentialsAdapter(credentials));
    GenericUrl url = new GenericUrl("https://identitytoolkit.googleapis" +
        ".com/v1/projects/" + PROJECT_ID + "/accounts:lookup");
    HttpContent content = new JsonHttpContent(
        Utils.getDefaultJsonFactory(),
        ImmutableMap.of("localId", ImmutableList.of("testuser")));
    HttpRequest request = requestFactory.buildPostRequest(url, content);
    HttpResponse response = request.execute();
    byte[] data = ByteStreams.toByteArray(response.getContent());
    String dataString = new String(data);
    if (response.getStatusCode() != 200) {
      throw new Exception("Error: " + dataString);
    }

    return dataString;
  }

In my test environment this results in numbers pretty close to my previous experiment with the SDK:

1065 305 193 224 227 207 246 
216 229 231 228 197 242 226 
227 200 292 311 210 231 198 
226 266 230 221 208 212 233 
728 226 383 531 223 190 212 

Average: 279.8285714285714
90p: 442.1999999999998
95p: 795.3999999999985
99p: 1065.0

Can you run the same test and see if the numbers agree with what you get when calling via the SDK? Feel free to try a different library for making REST calls. I just used google-http-client as it was already in my classpath. I'll try a different library at some point to see if it makes any difference.

getUser: ranges from 250 - 1500ms (the best I've seen is around 230ms)

That somehow already looks better than the numbers from your local machine. Minimum values from your machine are in the 600ms range. Can you check the 90th, 95th and 99th percentiles of this distribution from GAE? It sounds like they are what we'd expect to see with this API.

verifyIdToken()

verifyIdToken: usually <5ms, but intermittently hits 3000-5000ms

How often is "intermittent"? Note that verifyIdToken() caches public keys up to several hours. So every now and then you will see a request taking longer even if everything else was working perfectly. In a dynamic environment like GAE you might see additional slow requests due to other reasons like new instances warming up and the JVM running major gc cycles. Again you should look at the distribution and the tail latencies (90p, 95p, 99p) to be sure. But chances are this is the expected pattern for your environment and the workload.

hiranya911 commented 4 years ago

Also tried making REST calls with the Apache HTTP Client 4.x. It's about 100ms slower but the overall pattern holds.

  @Test
  public void testGetUserRestPerformance() throws Exception {
    InputStream file = new FileInputStream(SERVICE_ACCT_FILE);
    final GoogleCredentials credentials = GoogleCredentials.fromStream(file)
        .createScoped(ImmutableList.of("https://www.googleapis.com/auth/cloud-platform"));
    final CloseableHttpClient client = HttpClients.createDefault();
    DescriptiveStatistics stats = new DescriptiveStatistics();
    for (int i = 0; i < 35; i++) {
      long time = timeIt(new Callable<String>() {
        @Override
        public String call() throws Exception {
          return makeRestCall2(client, credentials);
        }
      });
      System.out.print(time + " ");
      if (i % 7 == 6) {
        System.out.println();
      }
      stats.addValue(time);
    }
    client.close();

    System.out.println();
    System.out.println("Average: " + stats.getMean());
    System.out.println("90p: " + stats.getPercentile(90));
    System.out.println("95p: " + stats.getPercentile(95));
    System.out.println("99p: " + stats.getPercentile(99));
  }

  private String makeRestCall2(CloseableHttpClient client, GoogleCredentials credentials) throws Exception {
    AccessToken token = credentials.getAccessToken();
    if (token == null) {
      token = credentials.refreshAccessToken();
    }

    HttpPost post = new HttpPost(
        "https://identitytoolkit.googleapis.com/v1/projects/" + PROJECT_ID + "/accounts:lookup");
    post.setHeader("Authorization", "Bearer " + token.getTokenValue());
    post.setEntity(new StringEntity("{\"localId\": [\"testuser\"]}", ContentType.APPLICATION_JSON));
    try (CloseableHttpResponse response = client.execute(post)) {
      byte[] data = ByteStreams.toByteArray(response.getEntity().getContent());
      String dataString = new String(data);
      if (response.getStatusLine().getStatusCode() != 200) {
        throw new Exception("Error: " + dataString);
      }

      return dataString;
    }
  }

Results:

1010 314 304 326 376 291 283 
308 310 297 299 299 311 800 
303 314 310 395 409 409 320 
702 321 330 300 301 341 305 
317 310 289 458 305 320 307 

Average: 368.4
90p: 555.5999999999997
95p: 841.9999999999991
99p: 1010.0
hiranya911 commented 4 years ago

I've also tried benchmarking the REST endpoint directly with Apache Bench, and the results are very similar to what we get from the SDK. I've hardcoded an ID token in my script, and therefore the credential fetch overhead is not included in the results.

Script:

ab -n 35 -c 1 -p getuser.json -T application/json -k -H "Authorization: Bearer ${IDTOKEN}" \
  "https://identitytoolkit.googleapis.com/v1/projects/${PROJECT_ID}/accounts:lookup"

Results:

Benchmarking identitytoolkit.googleapis.com (be patient).....done

Server Software:        ESF
Server Hostname:        identitytoolkit.googleapis.com
Server Port:            443
SSL/TLS Protocol:       TLSv1.2,ECDHE-ECDSA-CHACHA20-POLY1305,256,256
TLS Server Name:        identitytoolkit.googleapis.com

Document Path:          /v1/projects/********/accounts:lookup
Document Length:        330 bytes

Concurrency Level:      1
Time taken for tests:   10.640 seconds
Complete requests:      35
Failed requests:        0
Keep-Alive requests:    0
Total transferred:      34615 bytes
Total body sent:        16695
HTML transferred:       11550 bytes
Requests per second:    3.29 [#/sec] (mean)
Time per request:       304.010 [ms] (mean)
Time per request:       304.010 [ms] (mean, across all concurrent requests)
Transfer rate:          3.18 [Kbytes/sec] received
                        1.53 kb/s sent
                        4.71 kb/s total

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:       75   81   5.8     79      96
Processing:   154  223 127.1    186     626
Waiting:      154  221 127.3    184     626
Total:        235  304 128.6    268     717

Percentage of the requests served within a certain time (ms)
  50%    268
  66%    271
  75%    276
  80%    287
  90%    371
  95%    709
  98%    717
  99%    717
 100%    717 (longest request)

These numbers (especially the mean, 90p and 95p) are very close to the corresponding numbers measured from the SDK. That's further evidence that the SDK is not adding any unusual overhead on top of the REST call.

deepfriedbrain commented 4 years ago

@hiranya911 Thanks again for your sincere effort. I really appreciate it.

My app engine region is 'us-central'.

Based on about 1000 data points I've collected so far, a few things are clear (though I've not run them through analytics).

  1. verifyIdToken takes about 4500ms on average for the first request on a new instance on GAE. Subsequent requests take less than 5ms and the mode is 2ms. About 2% of the requests take between 50-100ms, but this is not a problem.

Why does it take so long for the first request? Do you consider this normal? The latency is just too high. Is there any way to make it better?

  1. getUser I don't have much data points on this yet as it's only called for unverified users to get the current verification status from the user object. But so far, more than half the requests take between 1000-1500ms on a warmed up and running instance. Average is around 700ms as on my local but the standard deviation is higher.

It may be possible that you don't see the same latency from your location, but I'm clearly seeing a high latency from a Google Cloud server in the US Central region.

I'll continue to monitor and collect more data points for the next 7 days.

hiranya911 commented 4 years ago

Hi @deepfriedbrain. My test project is also in us-central (nam5), and I'm running my tests from US West Coast. I'd expect the results to be similar for any client accessing these endpoints from US.

Why does it take so long for the first request? Do you consider this normal? The latency is just too high. Is there any way to make it better?

I certainly do expect the 1st request to be slower. But the numbers you're reporting is at least an order of magnitude larger than what I'd expect. I don't see why, and I cannot repro such numbers either.

To get another step closer to your setup, I've now deployed a test app in GAE Java 11 standard runtime at https://admin-java-integration.uc.r.appspot.com. This has 2 simple servlets:

@WebServlet(name = "VerifyIdToken", value = "/idtoken")
public class VerifyIdToken extends HttpServlet {

  private static final JsonFactory JSON_FACTORY = Utils.getDefaultJsonFactory();

  @Override
  protected void doPost(HttpServletRequest req, HttpServletResponse resp)
      throws ServletException, IOException {

    String idToken = req.getParameter("idToken");
    long start = System.nanoTime();
    try {
      FirebaseAuth.getInstance().verifyIdToken(idToken);
    } catch (FirebaseAuthException e) {
      throw new ServletException(e);
    }
    long duration = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - start);

    resp.setContentType("application/json");
    resp.getWriter().println(JSON_FACTORY.toString(ImmutableMap.of("duration", duration)));
  }
}
@WebServlet(name = "GetUser", value = "/user")
public class GetUser extends HttpServlet {

  private static final JsonFactory JSON_FACTORY = Utils.getDefaultJsonFactory();

  @Override
  protected void doGet(HttpServletRequest req, HttpServletResponse resp)
      throws ServletException, IOException {
    String uid = req.getParameter("uid");
    long start = System.nanoTime();
    try {
      UserRecord user = FirebaseAuth.getInstance().getUser(uid);
      long duration = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - start);

      resp.setContentType("application/json");
      resp.getWriter().println(JSON_FACTORY.toString(ImmutableMap.of(
          "uid", user.getUid(),
          "duration", duration)));
    } catch (FirebaseAuthException e) {
      throw new ServletException(e);
    }
  }
}

verifyIdToken()

Here are some consecutive readings from the 1st servlet. This is just the time spent on calling verifyIdToken():

1093, 5, 3, 3, 10, 7, 4, 98, 2, 3, 3, 2, 4, 2, 2, 9, 1387, 8, 2, 2

Key observations:

  1. My cold start latency is just over 1 second, which I would consider to be acceptable for what it does. But yours is over 4 seconds, which is problematic and I doubt all that time is spent in the SDK.
  2. Notice that my 17th request also took over 1s. But I found the following log entry in App Engine for this request:
    This request caused a new process to be started for your application, and thus caused
    your application code to be loaded for the first time. This request may thus take longer
    and use more CPU than a typical request for your application.

I don't know why GAE decided to spawn a new instance at this particular moment. May be my sequence of requests caused the auto scalar to kick in and start a new instance. But at least there's a clear and rational explanation as to why that request took longer to execute. Also note that the full end-to-end time to process this particular request was nearly 17 seconds, which is absurdly high! But it's also beyond our control, and more of a question for the GAE team.

getUser()

Here are some consecutive readings from the second servlet:

281, 161, 241, 138, 132, 134, 233, 151, 193, 162

These numbers are again what I'd expect to see. Since both the app and the Firebase project are in the same cloud region, their round trip latency should only be a few hundred milliseconds.

I think your app is unusually slow for some reason but I highly doubt the extra overhead has anything to do with the SDK. I think you will have to perform some careful benchmarking and profiling (using test apps similar to mine) to figure out what's exactly going on.

hiranya911 commented 4 years ago

For completeness here I'm also sharing some test results obtained using Apache Bench. These results contain the full end-to-end latency as measured from my test client in US West.

Server Software:        Google
Server Hostname:        admin-java-integration.uc.r.appspot.com
Server Port:            443
SSL/TLS Protocol:       TLSv1.2,ECDHE-ECDSA-CHACHA20-POLY1305,256,256
TLS Server Name:        admin-java-integration.uc.r.appspot.com

Document Path:          /idtoken
Document Length:        15 bytes

Concurrency Level:      1
Time taken for tests:   7.412 seconds
Complete requests:      100
Failed requests:        2
   (Connect: 0, Receive: 0, Length: 2, Exceptions: 0)
Keep-Alive requests:    100
Total transferred:      48606 bytes
Total body sent:        99200
HTML transferred:       1502 bytes
Requests per second:    13.49 [#/sec] (mean)
Time per request:       74.120 [ms] (mean)
Time per request:       74.120 [ms] (mean, across all concurrent requests)
Transfer rate:          6.40 [Kbytes/sec] received
                        13.07 kb/s sent
                        19.47 kb/s total

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    1   7.2      0      72
Processing:    60   73  11.4     71     136
Waiting:       60   73  11.4     71     136
Total:         60   74  12.8     71     136

Percentage of the requests served within a certain time (ms)
  50%     71
  66%     75
  75%     78
  80%     80
  90%     86
  95%     93
  98%    133
  99%    136
 100%    136 (longest request)
Server Software:        Google
Server Hostname:        admin-java-integration.uc.r.appspot.com
Server Port:            443
SSL/TLS Protocol:       TLSv1.2,ECDHE-ECDSA-CHACHA20-POLY1305,256,256
TLS Server Name:        admin-java-integration.uc.r.appspot.com

Document Path:          /user?uid=testuser
Document Length:        34 bytes

Concurrency Level:      1
Time taken for tests:   27.917 seconds
Complete requests:      100
Failed requests:        2
   (Connect: 0, Receive: 0, Length: 2, Exceptions: 0)
Keep-Alive requests:    100
Total transferred:      52810 bytes
HTML transferred:       3398 bytes
Requests per second:    3.58 [#/sec] (mean)
Time per request:       279.169 [ms] (mean)
Time per request:       279.169 [ms] (mean, across all concurrent requests)
Transfer rate:          1.85 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    1   8.3      0      83
Processing:   164  278  99.6    231     615
Waiting:      163  278  99.6    231     615
Total:        164  279 101.1    231     615

Percentage of the requests served within a certain time (ms)
  50%    231
  66%    308
  75%    352
  80%    399
  90%    410
  95%    429
  98%    614
  99%    615
 100%    615 (longest request)
hiranya911 commented 4 years ago

One thing to think about is the number of users in the project. My test project only has about 350 users. I wonder if the performance of the getUser() REST call degrades with the number of users in the project. If you suspect this to be the case, I'd recommend filing a bug with Firebase Support and have the Auth team chime in on that issue.

PS: Out of curiosity, around how many user accounts do you have in the project?

hiranya911 commented 4 years ago

Here are my cold start request latencies captured from GAE logs (these numbers are in seconds):

14.7, 16.9, 15, 14.6, 19.8, 8.1, 8.7, 7.8, 14,
18.8, 20.3, 7.3, 7.8, 14.4, 9.4, 31.6, 16, 8.8

These results were recorded over a period of 4 hours. In most of these cases the SDK latency was still in 1-2 second range. I did capture one instance where the SDK latency shot up to 5s. But that seems to be an extremely rare case, and certainly not the average case behavior.

hiranya911 commented 4 years ago

Digging a little bit deeper into one of the cold start requests. This one took 11.7 seconds in total, with verifyIdToken() taking 1271ms.

I 2020-06-02T01:23:56.229947Z POST 200 146 B 11.5 s curl/7.54.0 /idtoken POST 200 146 B 11.5 s curl/7.54.0 5ed5aa2c00ff03823b7433bf6f0001737e61646d696e2d6a6176612d696e746567726174696f6e000168656c6c6f776f726c64000100
A 2020-06-02T01:23:58.651047Z 2020-06-02 01:23:58.645:INFO::main: Logging initialized @1708ms to org.eclipse.jetty.util.log.StdErrLog 
A 2020-06-02T01:23:58.651047Z 2020-06-02 01:23:58.645:INFO::main: Logging initialized @1708ms to org.eclipse.jetty.util.log.StdErrLog 
A 2020-06-02T01:23:59.259396Z 2020-06-02 01:23:59.257:INFO:oejs.Server:main: jetty-9.4.29.v20200521; built: 2020-05-21T17:20:40.598Z; git: 77c232aed8a45c818fd27232278d9f95a021095e; jvm 11.0.7+10-post-Ubuntu-2ubuntu218.04 
A 2020-06-02T01:24:04.747636Z 2020-06-02 01:24:04.747:INFO:oeja.AnnotationConfiguration:main: Scanning elapsed time=2719ms 
A 2020-06-02T01:24:04.963535Z 2020-06-02 01:24:04.962:INFO:oejs.session:main: DefaultSessionIdManager workerName=node0 
A 2020-06-02T01:24:04.963560Z 2020-06-02 01:24:04.962:INFO:oejs.session:main: No SessionScavenger set, using defaults 
A 2020-06-02T01:24:04.969501Z 2020-06-02 01:24:04.967:INFO:oejs.session:main: node0 Scavenging every 600000ms 
A 2020-06-02T01:24:05.007690Z Initializing servlet app 
A 2020-06-02T01:24:05.303933Z SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". 
A 2020-06-02T01:24:05.303956Z SLF4J: Defaulting to no-operation (NOP) logger implementation 
A 2020-06-02T01:24:05.303968Z SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. 
A 2020-06-02T01:24:06.005288Z 2020-06-02 01:24:06.004:INFO:oejsh.ContextHandler:main: Started o.e.j.w.WebAppContext@4149c063{/,file:///tmp/jetty-0_0_0_0-8081-helloworld_war-_-any-16137155675089915255.dir/webapp/,AVAILABLE}{helloworld.war} 
A 2020-06-02T01:24:06.070240Z 2020-06-02 01:24:06.061:INFO:oejs.AbstractConnector:main: Started ServerConnector@59a6e353{HTTP/1.1, (http/1.1)}{0.0.0.0:8081} 
A 2020-06-02T01:24:06.070293Z 2020-06-02 01:24:06.068:INFO:oejs.Server:main: Started @9151ms 

We see that the request was first logged at 01:23:56. This is probably when it was received by GAE. The we see the ServletContextListener.contextInitialized() running at 01:24:05. So it took 9s just for the servlet container to bootstrap.

A 2020-06-02T01:24:06.070293Z 2020-06-02 01:24:06.068:INFO:oejs.Server:main: Started @9151ms 

After that it only took 2.7s more to process the request, of which 1.2 was spent in the SDK. Could it be that the extra overhead you're seeing is simply the time to bootstrap the servlet container?

deepfriedbrain commented 4 years ago

@hiranya911

I would take 1.5s for the first call any day.

My GAE app is on Java 8 Standard Runtime (F1).

My Firebase app has a few tens of a thousand users. So to eliminate the possibility of the number of users causing this issue, I used another test Firebase project that had just 10 users. I got the exact same latency on the first request to verifyIdToken() averaging about 4500ms, and getUser() took about 1500ms. So it's not related to the number of users in the Firebase project.

My app receives a warmup request that takes 6-7s usually and up to 12s occasionally. This initializes all my Singleton classes (about 25 of them) and one of them also initializes the FirebaseApp. The FirebaseApp initialization takes about 200ms.

My app has several EndPoints. Every EP calls a method to verify the Id Token first. That method calls FirebaseAuth.getInstance(), and it takes about 200ms or so (for the first request). Then comes verifyIdToken(), which takes about 4500ms. So these 2 calls take up the first 4.5-5s of the very first request to 'any' EP in my application. Some EPs are simple get or put calls and they usually take less than 100ms, but because of the FirebaseAuth calls, if the first request goes to such EPs, the response time becomes 5s+.

I have a very rudimentary approach to time the call. There's really no bootstrap time included in this as far as I can tell. I'm picking the time from the logs and also from the datastore.

startTime = new Date().getTime();
decodedToken = auth.verifyIdToken(idToken);
endTime = new Date().getTime();
logger.info("Time to verifyIdToken : " + (endTime - startTime));

// Log the stats in the datastore as well

Could there be something wrong with the way that I initialize my FirebaseApp? I've shared my initialization code in my second comment from the top.

I also went back to look at the old logs from an older version of my app, which was using Google Identity Toolkit token verification. For the first request, it used to take about 1 - 1.2s. So comparatively also for the very same app, I'm seeing a huge increase in latency for the first request. There's no material change other than replacing Identity Toolkit SDK with FirebaseAuth Admin SDK in the method that verifies the Id Token.

Is there a way to do the Id Token verification using REST API? I can try it and see if it makes any difference.

hiranya911 commented 4 years ago

My GAE app is on Java 8 Standard Runtime (F1).

GAE Java8 runtime is a very different stack so I've now deployed my test app on Java8 (their network stack is particularly unusual and has caused issues in the past). But I'm still not seeing the high latencies you're reporting.

First request to /user: 1086ms First request to /idtoken: 865ms (This is probably under reported, since the request to /user has already warmed up parts of the JVM, I'll try with a fresh request later.)

The end-to-end time for the first request was close to 11 seconds:

W 2020-06-02T18:38:48.553360Z GET 200 104 B 10.9 s curl/7.54.0 /user?uid=testuser GET 200 104 B 10.9 s curl/7.54.0 5ed69cb800ff087190a5166b170001737e61646d696e2d6a6176612d696e746567726174696f6e0001323032303036303274313133373239000100

You can also try the /user endpoint by sending a request to https://admin-java-integration.uc.r.appspot.com/user?uid=testuser. I'll keep the endpoint up for a few days.

Post warmup benchmark results:

Server Software:        Google
Server Hostname:        admin-java-integration.uc.r.appspot.com
Server Port:            443
SSL/TLS Protocol:       TLSv1.2,ECDHE-ECDSA-CHACHA20-POLY1305,256,256
TLS Server Name:        admin-java-integration.uc.r.appspot.com

Document Path:          /idtoken
Document Length:        15 bytes

Concurrency Level:      1
Time taken for tests:   7.484 seconds
Complete requests:      100
Failed requests:        1
   (Connect: 0, Receive: 0, Length: 1, Exceptions: 0)
Keep-Alive requests:    100
Total transferred:      48605 bytes
Total body sent:        99200
HTML transferred:       1501 bytes
Requests per second:    13.36 [#/sec] (mean)
Time per request:       74.836 [ms] (mean)
Time per request:       74.836 [ms] (mean, across all concurrent requests)
Transfer rate:          6.34 [Kbytes/sec] received
                        12.95 kb/s sent
                        19.29 kb/s total

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    1   9.3      0      93
Processing:    60   74  23.7     68     196
Waiting:       60   74  23.7     67     196
Total:         60   75  25.1     68     196

Percentage of the requests served within a certain time (ms)
  50%     68
  66%     69
  75%     70
  80%     72
  90%    100
  95%    105
  98%    196
  99%    196
 100%    196 (longest request)
Server Software:        Google
Server Hostname:        admin-java-integration.uc.r.appspot.com
Server Port:            443
SSL/TLS Protocol:       TLSv1.2,ECDHE-ECDSA-CHACHA20-POLY1305,256,256
TLS Server Name:        admin-java-integration.uc.r.appspot.com

Document Path:          /user?uid=testuser
Document Length:        34 bytes

Concurrency Level:      1
Time taken for tests:   19.249 seconds
Complete requests:      100
Failed requests:        19
   (Connect: 0, Receive: 0, Length: 19, Exceptions: 0)
Keep-Alive requests:    100
Total transferred:      50489 bytes
HTML transferred:       3381 bytes
Requests per second:    5.19 [#/sec] (mean)
Time per request:       192.495 [ms] (mean)
Time per request:       192.495 [ms] (mean, across all concurrent requests)
Transfer rate:          2.56 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    1  10.1      0     101
Processing:   146  191  38.8    184     453
Waiting:      146  191  38.8    184     453
Total:        146  192  40.5    184     453

Percentage of the requests served within a certain time (ms)
  50%    184
  66%    195
  75%    204
  80%    206
  90%    220
  95%    294
  98%    317
  99%    453
 100%    453 (longest request)

That method calls FirebaseAuth.getInstance(), and it takes about 200ms or so (for the first request).

Even that's quite large. It's purely an in-memory operation (no RPCs), and shouldn't take more than 10-20ms. On my local machine I got following readings for the first execution of this method: 7, 10, 8, 12, 10, 9.

Could there be something wrong with the way that I initialize my FirebaseApp?

I don't think so. Looks reasonable to me. I use the following code inside a ServerletContextListener to init the SDK.

  @Override
  public void contextInitialized(ServletContextEvent servletContextEvent) {
    System.out.println("Initializing servlet app");
    try {
      FirebaseOptions options = FirebaseOptions.builder()
          .setCredentials(GoogleCredentials.getApplicationDefault())
          .setProjectId("admin-java-integration")
          .build();
      FirebaseApp.initializeApp(options);
    } catch (IOException e) {
      throw new RuntimeException("Failed to initialize app", e);
    }
  }

Is there a way to do the Id Token verification using REST API?

I'm not aware of such a way. But you can try to implement your own using the guidance provided at https://firebase.google.com/docs/auth/admin/verify-id-tokens#verify_id_tokens_using_a_third-party_jwt_library

hiranya911 commented 4 years ago

I think at this point I'm reasonably convinced that there are no glaring performance issues in the SDK itself. The results from the 3 environments I've tested this on (local machine, gae java8, gae java11) look pretty good. I'm assuming you can't repro these long latency readings for verifyIdToken() either when running it locally.

I've also gathered some cold start latency times for verifyIdToken() on gae-java8, which are pretty large, but the time spent on the SDK is still around 1-2s:

sdk_time_ms, e2e_time_ms
1227, 10826
2623, 24163
1046, 13767
1235, 14778

Also from what I can see, once an instance has started up it seems to stick around for about 15 minutes before its shutdown. So unless the requests are quite far apart in time, I don't see how one can encounter such slow requests frequently.

At this point I'd advise you to contact GAE support and see if they can provide some guidance. Since this is a case of an API running slow only when deployed to GAE, they might be able to provide some assistance. Or please provide an mcve that we can deploy to App Engine and run tests on.

hiranya911 commented 4 years ago

I'm finally able to repro some of those high latency requests, but only under very specific conditions:

  1. Code must be running in GAE java8 prod environment (doesn't happen anywhere else, including the local devappserver).
  2. App must be already loaded via a warmup request or by some other means (it is not an app cold start issue as I initially thought).
  3. No other calls to verifyIdToken() should have already been made.

When all these conditions are met, I was able to get the following readings:

sdk_time_ms, e2e_time_ms
3074, 3548
4078, 4750

It's very tricky to repro, since after each reading I have to wait around 20 minutes for the app to get unloaded before I can test again.

This clearly seems to be some GAE-specific weirdness. The same code when run locally (using the local devappserver), takes only 500-700ms for that initial call:

sdk_time_ms, e2e_time_ms
544, 581
575, 621

So it's nearly a 10x slowdown in the prod environment. I'd recommend taking this information to GAE support and see if they can provide an explanation.

I can think of one optimization we can possibly make in the SDK, but I'm hesitant to implement it because: a. I'm not a big fan of making optimizations targeting specific environments (especially the old java8 environment). b. Without knowing what exactly causes this slowdown, there's no guarantee it would even work.

deepfriedbrain commented 4 years ago

@hiranya911 All 3 conditions apply in my case.

I'm actually not sure why a warmup request would make it slow. If anything, it's supposed to make it faster. I will remove my setting for warmup request today and see how it goes.

I'll also look into upgrading from Java 8 to Java 11. It's overdue anyway, but I didn't have a compelling reason to do so.

I'm on the Basic plan with GAE and so far I have not had a lot of luck getting any of my issues addressed. It might be actually easier to upgrade to Java 11.

Regarding the optimization in the SDK, if it's quick enough for you to do, you can send me the updated Jar and I can try to test it with my app.

I'll also in the middle of doing some investigation and will share my findings soon.

Thank you for your awesome work!

hiranya911 commented 4 years ago

I'm actually not sure why a warmup request would make it slow. If anything, it's supposed to make it faster. I will remove my setting for warmup request today and see how it goes.

It is indeed very strange. My theory (which is very flimsy at the moment) is that when you run verifyIdToken() as part of a cold start request, some of the initialization cost gets amortized. But if you run it for the first time outside of a cold start request, you see the full latency of 4-5s.

Regarding the optimization in the SDK, if it's quick enough for you to do, you can send me the updated Jar and I can try to test it with my app.

I can try to get an experimental jar file uploaded tomorrow.

Can you also try using the UrlFetchTransport in FirebaseOptions? I don't have a ton of data points to verify, but the few I have suggests it's slightly faster in App Engine.

import com.google.api.client.extensions.appengine.http.UrlFetchTransport;

FirebaseOptions options = FirebaseOptions.builder()
          .setHttpTransport(UrlFetchTransport.getDefaultInstance()) // <--- Use this
          .setCredentials(GoogleCredentials.getApplicationDefault())
          .setProjectId("...")
          .build();
sdk_time_ms, e2e_time_ms
2192, 2659
2190, 2471
deepfriedbrain commented 4 years ago

Sure, I'll try it out and let you know.

BTW, on your earlier point about waiting 20 minutes for an instance to shutdown, you can manually Delete an instance from the App Engine console.

deepfriedbrain commented 4 years ago

@hiranya911

The warmup request does seem to impact the time for verifyIdToken() in my tests as well. For cold start, you first need to disable warmup requests in the appengine-web.xml and then Delete your instance from the console, or redeploy your app. In both cases, you'll not see any warmup requests anymore.

  <warmup-requests-enabled>false</warmup-requests-enabled>

Using UrlFetchTransport seems to make a big difference with verifyIdToken(). These are my request timings with 'warmup':

2196 3018 3006 2831 2624 2893 2685 2803 2616

These are the timings with cold start:

2113 1588 1903 1688

These are pretty much in-line with the few data points that you shared. Most numbers for warmup are between 2.5-3s and the cold start are about full second less. In my production, I would still use warmup because that saves about 5-6s of upfront startup time. So even with 1sec loss, it's still saves 4-5s.

This is a huge win!

Just to be very sure, I removed UrlFetchTransport and the number went back up to 3.5-4.5s. Then I put it back and the numbers went down again. So it makes a difference for sure.

The time for FirebaseAuth.getInstance() has improved as well. There are the numbers corresponding to the above 9 requests:

For warmed up server:

101 93 83 85 192 13 10 88 100

For cold start server:

69 15 9 8

Previously, these were in the 200-250ms range.

Now the bummer - the first call to getUser() seems to get slower when using UrlFetchTransport. I don't have too many data points for this, but here's what I got:

getUser first request with UrlFetchTransport with warmed up server:

3610 4189 4491 3861

Cold start numbers are:

4000 3942 3805 3450

getUser first request without UrlFetchTransport with warmed up server:

3005 3097 3532 2504

But the subsequent requests for getUser() are now much faster and averaging just about 150ms. This is another big win. Previously, this was around 700ms.

NOTE: In all my previous tests getUser() was not getting called in the first request because my code was only calling it for unverified users and my test user was already verified. I don't have any data point from production where an unverified users made the first request to a warmed up server or cold started a server.

Do you have any explanation for why the first request to getUser() gets slower and why it would take this long? Also, it doesn't get any faster with cold start either. I thought that the preceding verifyIdToken() would have done all the caching stuff.

BTW, is UrlFetchTransport the optimization that you alluded to earlier?

My next target is to upgrade to Java 11. It might take me a few days. I really have no clue what it would take, but I'll get there soon. I'll post my updates after the upgrade.

hiranya911 commented 4 years ago

Now the bummer - the first call to getUser() seems to get slower when using UrlFetchTransport.

This API uses credentials. So try passing UrlFetchTransport to your GoogleCredentials instance as well.

GoogleCredentials credentials = GoogleCredentials.fromStream(in, new HttpTransportFactory() {
    @Override
    public HttpTransport create() {
        return UrlFetchTransport.getDefaultInstance();
    }
});

Do you have any explanation for why the first request to getUser() gets slower and why it would take this long? Also, it doesn't get any faster with cold start either. I thought that the preceding verifyIdToken() would have done all the caching stuff.

These 2 APIs are very different and don't share anything. verifyIdToken() only makes a network call on the first request, and the subsequent invocations are handled entirely locally. getUser() makes 2 network calls on the first request, and the subsequent invocations do one network call each. The network calls are to entirely different endpoints so calling verifyIdToken() doesn't do anything to help the latency of getUser().

Hopefully using UrlFetchTransport in the credentials addresses this slow down to a certain degree.

BTW, is UrlFetchTransport the optimization that you alluded to earlier?

No I was thinking about a different code change in the SDK. Then it will require another code change in your end to use it too.

My next target is to upgrade to Java 11. It might take me a few days. I really have no clue what it would take, but I'll get there soon. I'll post my updates after the upgrade.

I'll wait for your results from the java11 environment before making any changes in the SDK. Hopefully using the new environment will address most if not all of the performance issues. Note that UrlFetchTransport cannot be used in java11.

deepfriedbrain commented 4 years ago

@hiranya911

Yes with UrlFetchTransport in the credential, getUser() is down to 1.2s on average for the first request. This is a huge improvement (1/3rd) over the previous result.

The upgrade of my GAE app from Java 8 to 11 is not as straight-forward as I thought. I read through the migration guide and found that several key functions, which I'm currently using, are removed and need to be replaced with third-party solutions or other alternatives.

Datastore would need to move to Firestore. Memcache, Mail, Logging, etc. need to move to alternative solutions. There could be other changes too.

I'm not sure whether I'll be upgrading in near future. I need some time to research and plan first. I'm fine to close this issue now, as the issue is resolved to a good extent. I'll definitely come back to post my updates after upgrading to Java 11. Also if you do work on the experimental fix in the SDK and want me to try it, let me know.

I really appreciate your help. You went way beyond my expectations, to help. I learned a lot as well through the interactions on this issue. Thank you so much. Keep up the great work!

hiranya911 commented 4 years ago

Thanks @deepfriedbrain. I will close this based on the progress made so far. We can revisit this issue in the future if further improvements are necessary.

deepfriedbrain commented 4 years ago

@hiranya911 I've developed a workaround to shift the latency to the warmup request.

Inside the ServletContextListener, for a test user account, I generate a custom token , then use it to exchange for an Id Token, and finally call verifyIdToken(). Using this approach, the latency gets shifted to the warmup request and the first user request to a new instance goes through much faster.

However, these are still several requests (1. Generate custom token, 2. REST API call to get Id Token by passing custom token, 3. verify Id Token). I was wondering if there's a shortcut approach where I can get verifyIdToken() to do the caching that it would normally do without this 3 step process. It seems like it always needs a real valid Id Token to do the caching that it would do for the first request.

hiranya911 commented 4 years ago

I was wondering if there's a shortcut approach where I can get verifyIdToken() to do the caching that it would normally do without this 3 step process.

There's currently no other way. You need a valid ID token to warm up the underlying public key manager.