Guarding against perf regression for acquire_token_for_client()

Goals defined in early July 2023

This PR may eventually accomplish the following goals.

Scenarios: A multi-tenant S2S with memory token cache which is (1) initially empty, and all incoming requests will generate cache miss; or (2) prepopulated with 1000 tenants and 10 tokens per tenant, and all incoming requests will generate cache hit.
Metrics: latency in ms.
How data is collected and stored (Aim for 1 DB schema)? TBD
How do we read the historical data into the build (from that central DB described in Q3)? (The intention is that we will have a baseline to compare to, and break the build on regression.)

UPDATE at late July 2023

We followed the spirit of the goals above and pivoted along the way.

This PR populates the token cache for test cases in the following matrix, and during the benchmark, token cache remains unchanged, so that its size would not grow.

	New token request result in cache hits	New token requests result in cache miss
~1 tenant, 1 token in cache~
1 tenant, 10 tokens in cache
~1 tenant, 1000 tokens in cache~
1000 tenants, 10 tokens per tenant in cache

Metrics. Our chosen benchmark tool happens to use "operations per second". Internally we use "time elapsed during a test case". They are mathematically equivalent.
Collecting data into a central database? The purpose was to render historical data into a nice chart. We found a lightweight Github Benchmark Action that can render charts without dependency of a central database. See the diagram generated by this PR.
How to get baseline data to be used as a baseline to detect regression?

There are two mechanisms available. Neither of them requires a central DB. We may choose either or both of them.
- The Github Benchmark Action we use will store historical benchmark data into a special git branch.
- ~We can also dynamically benchmark a stable computation load as a reference, and then comparing our MSAL performance against that known load. For example, the load can be "the time needed to calculate fibonacci sequence for 20", or just actually benchmark the built-in dictionary implementation of the language we are using. And then we can benchmark MSAL's scenarios (see our goal 1 above), which will be - let's say - 1.234 times slower than the reference. Now, our threshold can be defines as "the performance shall not be longer than 1.234 * (1 + threshold)".~ This experiment was abandoned, because its reference workload has different characteristic than those real workloads, and the ratio between real workload and reference seems different on different test machines.
In either approach, the variation is in the range of double-digit percentage. The Github Benchmark Action uses a 200% threshold as its default value.

Implementation Details (useful when you want to review this PR's implementation)

A Perf Test solution contains 3 high level components:

(a) a tailored simulator that can generate load for our test subject; (b) a perf tester that can run those simulators" many times, in deterministic amount of time, producing some statistics, preferably with a small standard deviation; (c) a mechanism to store historical results of perf tester*, in order to detect regression and render wow diagrams.

We have to do "a. simulator" ourselves, which we did, in the file tests/simulator.py.

We chose the "Continuous Benchmark" github action which is mainly a tool for "c. detect and render", but it cleverly consumes the output of many other tools that operate in "b. perf tester" space, one per language. In this PR, we use pytest-benchmark as our "perf tester", with test cases defined in test_benchmark.py.

The configuration for Continuous Benchmark is in python-package.yml.

AzureAD / microsoft-authentication-library-for-python