cesium-ml / baselayer

Fully customizable (scientific, compute-intensive) web application template
http://cesium-ml.org/baselayer/
31 stars 18 forks source link

scope database sessions to requests instead of threads #214

Closed dannygoldstein closed 3 years ago

dannygoldstein commented 3 years ago

Prerequisite reading: https://docs.sqlalchemy.org/en/14/orm/session_basics.html#session-faq-whentocreate

At the moment, we maintain a single sqlalchemy database session per thread and use this session for all requests handled by the thread. Said another way, the scope of our sessions currently is the entire thread, rather than the request. This model is unsafe. Consider the following scenario:

  1. Web worker 1 receives a request that routes to an asynchronous handler. The handler launches into a non-blocking coroutine.
  2. Web worker 1, free to handle new requests, receives another request that is routed to a blocking handler.
  3. Because both of these requests are being processed simultaneously in the same thread, there is a critical information leak: all of the objects brought into the thread's database session by request 1 are mixed with the objects from request 2

This breaks the permission checking framework, which assumes that all of the objects in the current session are in the scope of the current request (rather than the current thread).

This is a known issue with SQLalchemy and tornado. From the Sqlalchemy docs:

https://docs.sqlalchemy.org/en/14/orm/contextual.html#using-thread-local-scope-with-web-applications

As it turns out, most Python web frameworks, with notable exceptions such as the asynchronous frameworks Twisted and Tornado, use threads in a simple way, such that a particular web request is received, processed, and completed within the scope of a single worker thread.

Since this is not the case for Tornado, we need to explicitly specify the scope of our sessions:

https://docs.sqlalchemy.org/en/14/orm/contextual.html#using-custom-created-scopes

This PR implements a get_request_id function as a contextvar:

https://docs.python.org/3/library/contextvars.html

This module provides APIs to manage, store, and access context-local state. The ContextVar class is used to declare and work with Context Variables. The copy_context() function and the Context class should be used to manage the current context in asynchronous frameworks.

Context managers that have state should use Context Variables instead of threading.local() to prevent their state from bleeding to other code unexpectedly, when used in concurrent code.

The context var keeps track of the current request ID, which I pass to the handlers from NGINX, following this guide:

https://quanttype.net/posts/2020-02-05-request-id-logging.html

This ensures that each request context maintains its own database session, scoped to the request.

Related to #213.

pep8speaks commented 3 years ago

Hello @dannygoldstein! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:

Comment last updated at 2021-03-24 17:51:59 UTC
stefanv commented 3 years ago

I will read in detail tomorrow, but relevant: we do not have threads that run async at the moment, afaik.

dannygoldstein commented 3 years ago

@stefanv Perhaps thread-safe is not the most accurate term for this PR - thread-safe refers to collisions between multiple threads trying to access the same address space. The problem that this PR addresses is somewhat different-- when one single thread is concurrently handling multiple requests, the different requests both share a session, causing the permissions checking infrastructure (which assumes sessions are scoped to requests) to fail. This pr simply enforces that database sessions are scoped to requests instead of threads

acrellin commented 3 years ago

I believe Dima has implemented some async handler methods in F extensions

On Tue, Mar 23, 2021, 5:39 PM Danny Goldstein @.***> wrote:

@stefanv https://github.com/stefanv Perhaps thread-safe is not the most accurate term for this PR - thread-safe refers to collisions between multiple threads trying to access the same address space. The problem that this PR addresses is somewhat different-- when one single thread is concurrently handling multiple requests, there is currently a bleed through between the different requests' database sessions.

— You are receiving this because your review was requested. Reply to this email directly, view it on GitHub https://github.com/cesium-ml/baselayer/pull/214#issuecomment-805381473, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABXFGTIXAQG4N35UEJIFVMDTFEYCVANCNFSM4ZWBBKZQ .

dannygoldstein commented 3 years ago

The offsets handler is async, as is the finding chart generator

stefanv commented 3 years ago

@dannygoldstein This is perfect. The only thing I don't follow is where you set the HTTP header. Do you mind point that out?

stefanv commented 3 years ago

Second question: why do we need a context variable, instead of an instance UUID?

acrellin commented 3 years ago

@dannygoldstein @stefanv It looks like it's being set by nginx here https://github.com/cesium-ml/baselayer/pull/214/files#diff-374ca41bad5bb33fc44ec617aef88cd102525dfe18eb1c89a75cea68e8f83c85R56 , but not all clients provide that header, right? How do we handle those cases?

stefanv commented 3 years ago

That's just the proxy forwarding the header, isn't it?

acrellin commented 3 years ago

Right, so it looks like this depends on whether the client provides that header in the request?

acrellin commented 3 years ago

nginx can set a header to UUID, if desired, though

dannygoldstein commented 3 years ago

$request_id is an embedded variable that is generated by NGINX for each incoming request -

http://nginx.org/en/docs/http/ngx_http_core_module.html - grep for $request_id (no permalink).

On Wed, Mar 24, 2021 at 10:25 AM Ari Crellin-Quick @.***> wrote:

Right, so it looks like this depends on whether the client provides that header in the request?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cesium-ml/baselayer/pull/214#issuecomment-806016355, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAVEFYFFZMCQHGVIAJJPZUTTFIN7XANCNFSM4ZWBBKZQ .

-- Danny Goldstein http://astro.caltech.edu/~danny

acrellin commented 3 years ago

ah, perfect

acrellin commented 3 years ago

@dannygoldstein but where does the "request-id" header get set? Does the X-Request-ID header get set with that key?

dannygoldstein commented 3 years ago

I followed this guide to set the request-id header using the nginx embedded variable request_id, perhaps it should be request_id instead of X-Request-Id for consistency with how I am accessing it in the handler? or perhaps we can simply use a uuid4 and there's no need to let nginx assign the ID.

https://stackoverflow.com/questions/17748735/setting-a-trace-id-in-nginx-load-balancer/38447872#38447872

On Wed, Mar 24, 2021 at 10:31 AM Ari Crellin-Quick @.***> wrote:

@dannygoldstein https://github.com/dannygoldstein but where does the "request-id" header get set? Does the X-Request-ID header get set with that key?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cesium-ml/baselayer/pull/214#issuecomment-806021064, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAVEFYFIZCE6IX26AJGFM3TTFIOYLANCNFSM4ZWBBKZQ .

-- Danny Goldstein http://astro.caltech.edu/~danny

dannygoldstein commented 3 years ago

Second question: why do we need a context variable, instead of an instance UUID?

The context variable ensures that this variable can take on different values in different contexts (requests) running asynchronously within the same thread.

What do you mean by instance uuid?

acrellin commented 3 years ago

@dannygoldstein no that part looks fine, but my question is whether the header would actually be "X-Request-ID" rather than the "request-id" you're accessing from within the handler

dannygoldstein commented 3 years ago

@acrellin I've now sidestepped the issue altogether and simply assigned the request_id inside of Tornado instead of forwarding the value from nginx.

stefanv commented 3 years ago

@dannygoldstein What I'm trying to understand is why we need the contexts, and not simply set an attribute on the request instance.

dannygoldstein commented 3 years ago

@dannygoldstein What I'm trying to understand is why we need the contexts, and not simply set an attribute on the request instance.

Can you show an example?

stefanv commented 3 years ago

Each request instance handles a specific HTTP connection, if I'm not mistaken. And you cannot have multiple threads that access the database within one request?

dannygoldstein commented 3 years ago

@stefanv The database session is initialized at the baselayer/app/models.py module level. What I think you are suggesting is to do something like

class BaseHandler(PSAHandler):

    def prepare(self):
        self.session = scoped_session(sessionfactory())

    def get(self):
        self.session.query(...)

    def on_finish(self):
        self.session.remove()

Problem with this is we would have to go and refactor hundreds of lines of handler code to use self.session instead of DBSession. We'd also have to create a separate sessionfactory for sessions that are used outside of request handlers (e.g., in the IPython terminal). It would be a huge change.

dannygoldstein commented 3 years ago

We'd also have to refactor the event handlers, test suite, fixtures, etc. It would end being thousands of lines of code.

dannygoldstein commented 3 years ago

This way we have a single session factory, consistent scoping both in and out of request handlers, and no need to refactor any other code in the entire code base.

dannygoldstein commented 3 years ago

@dannygoldstein What I'm trying to understand is why we need the contexts, and not simply set an attribute on the request instance.

The reason is that DBSession is initialized at the top level of app/models.py, and can't access any variables local to handler methods in its scopefunc. That's what the contextvar allows us to get around.

stefanv commented 3 years ago

@profjsb "but it's only 6 lines of code"! :joy:

Turns out these 6 (now 5) lines of code are... tricky. I spent a lot of time trying to figure out exactly how Tornado navigates contexts. Take a look at this Tornado issue, e.g.:

https://github.com/tornadoweb/tornado/pull/2938

(This patch was introduced after the 6.0.4 that we depend on.)

Long and short of it, I think the reason any of this works is because a) Tornado 5 and up builds on asyncio and b) Tornado now has some context awareness itself.

What is not clear to me yet is exactly when contexts switch out. Ideally, we'd want a context switch per request handler, but we may get now get contexts depending on... well, I don't know and I need help here. I cannot find any Tornado documentation that discusses contexts.

The asyncio documentation states:

When a Task is created it copies the current context and later runs its coroutine in the copied context.

The Task in this case would be a method of our handler, I suspect.

What I would like to test next is the following:

  1. Instantiate two handlers, a and b
  2. Run a.prepare, then run b.prepare
  3. Execute a method of a then b. Verify whether those methods have access to different contexts.
dannygoldstein commented 3 years ago

@stefanv https://groups.google.com/g/python-tornado/c/8mzGNnhUxtU?pli=1

from Ben Darnell:

"It will work automatically for native (async def) coroutines"

stefanv commented 3 years ago

@dannygoldstein Right, but what does that mean?

import contextvars
import uuid

ctx = contextvars.ContextVar('request_id', default=None)

class MyHandler:
    def __init__(self, name):
        self._name = name
        ctx.set(uuid.uuid4().hex)

    def get(self):
        print(self._name, ' -> ', ctx.get())

print('Default context:', ctx.get())
print('Initializing handler A')
a = MyHandler('a')
print('Global context value:', ctx.get())
print('Initializing handler B')
b = MyHandler('b')
print('Global context value:', ctx.get())
print('Executing a.get')
a.get()
print('Executing b.get')
b.get()
Default context: None
Initializing handler A
Global context value: d9890d36505049c0b44160aa7d0f40e6
Initializing handler B
Global context value: 3071db7f53b64203a51e2795ee0f2bc0
Executing a.get
a  ->  3071db7f53b64203a51e2795ee0f2bc0
Executing b.get
b  ->  3071db7f53b64203a51e2795ee0f2bc0
dannygoldstein commented 3 years ago

@stefanv I think the issue you linked do above is only relevant to coroutines that are created using the tornado.gen.coroutine module and not the native python ones that are created using async def. We have only been using the latter so far in SP and F

stefanv commented 3 years ago

@stefanv I think the issue you linked do above is only relevant to coroutines that are created using the tornado.gen.coroutine module and not the native python ones that are created using async def. We have only been using the latter so far in SP and F

Yes, not sure exactly how those differ other than https://www.tornadoweb.org/en/stable/guide/coroutines.html#native-vs-decorated-coroutines

dannygoldstein commented 3 years ago

@stefanv I think there is only one "context" in your example (the global context), whereas in our situations there would be 2 separate contexts, one for each request -- I think a closer analog to what would happen in tornado would be:

import contextvars
import asyncio
import uuid

ctx = contextvars.ContextVar('request_id', default=None)

class MyHandler:
    def __init__(self, name):
        self._name = name
        ctx.set(uuid.uuid4().hex)

    def get(self):
        print(self._name, ' -> ', ctx.get())

async def create_handler_and_set_context(name):
    handler = MyHandler(name)
    print(f'Initializing handler {name}')
    print('Global context value:', ctx.get())
    print(f'Executing {name}.get')
    handler.get()

async def main():
    await create_handler_and_set_context('a')
    await create_handler_and_set_context('b')

asyncio.run(main())
print('Global context value:', ctx.get())
(skyportal2) Dannys-MBP:baselayer danny$ python resp.py
Initializing handler a
Global context value: 5e451d05d07548b68c45196197e7b721
Executing a.get
a  ->  5e451d05d07548b68c45196197e7b721
Initializing handler b
Global context value: f322233198eb426da64ddfb9fbc6b99e
Executing b.get
b  ->  f322233198eb426da64ddfb9fbc6b99e
Global context value: None
stefanv commented 3 years ago

@dannygoldstein I was wondering more about what would happen to all the "standard" calls (i.e., non-async). I presume we'd prefer for them to also get their own context?

I think we can run an experiment with real Tornado to see whether what you're saying above holds (we can use get handlers that sleep for a while).

dannygoldstein commented 3 years ago

@stefanv Want to give that a shot? I tried implementing with asyncio and I think it would look something like this:

import contextvars
import asyncio
import uuid
import time

ctx = contextvars.ContextVar('request_id', default=None)

class MyHandler:
    def __init__(self, name):
        self._name = name
        ctx.set(uuid.uuid4().hex)

    def get(self):
        print(self._name, ' -> ', ctx.get())

def _body(name):
    handler = MyHandler(name)
    print(f'Initializing handler {name}')
    print('Global context value:', ctx.get())
    print(f'Executing {name}.get')
    handler.get()

async def create_handler_and_set_context(name):
    _body(name)
    time.sleep(3)    

def blocking_create_handler_and_set_context(name):
    _body(name)
    time.sleep(1.5)

async def main():
    asyncio.create_task(create_handler_and_set_context('async a'))
    asyncio.create_task(create_handler_and_set_context('async b'))
    blocking_create_handler_and_set_context('block c')
    blocking_create_handler_and_set_context('block d')
    asyncio.create_task(create_handler_and_set_context('async e'))

asyncio.run(main())
print('Global context value:', ctx.get())
(skyportal2) Dannys-MBP:baselayer danny$ python resp.py
Initializing handler block c
Global context value: c1ac0cbc2db24259809a9c45411e72ea
Executing block c.get
block c  ->  c1ac0cbc2db24259809a9c45411e72ea
Initializing handler block d
Global context value: 40231eea993d4060995e7131a8971e7e
Executing block d.get
block d  ->  40231eea993d4060995e7131a8971e7e
Initializing handler async a
Global context value: 254b3adfa6a74c04b9fd4c1ca2873007
Executing async a.get
async a  ->  254b3adfa6a74c04b9fd4c1ca2873007
Initializing handler async b
Global context value: 7077808e0fd84fddae4781171489c712
Executing async b.get
async b  ->  7077808e0fd84fddae4781171489c712
Initializing handler async e
Global context value: 9ec747e303324365b22a6692828a4dc7
Executing async e.get
async e  ->  9ec747e303324365b22a6692828a4dc7
Global context value: None
dannygoldstein commented 3 years ago

From tornado docs:

https://www.tornadoweb.org/en/stable/index.html#asyncio-integration

asyncio Integration Tornado is integrated with the standard library asyncio module and shares the same event loop (by default since Tornado 5.0). In general, libraries designed for use with asyncio can be mixed freely with Tornado.

dannygoldstein commented 3 years ago

@stefanv here is a tornado version,

from tornado import web, ioloop
import time
import numpy as np
import contextvars
import asyncio
import uuid

request_id = contextvars.ContextVar('request_id', default=None)

class BaseHandler(web.RequestHandler):

    def prepare(self):
        request_id.set(uuid.uuid4().hex)

class BlockingHandler(BaseHandler):

    def get(self):
        time.sleep(np.random.rand())
        self.write(request_id.get())

class AsyncHandler(BaseHandler):

    async def get(self):
        await asyncio.sleep(np.random.rand())
        self.write(request_id.get())

def make_app():
    return web.Application([
        (r"/blocking", BlockingHandler),
        (r'/async', AsyncHandler)
    ])

if __name__ == "__main__":
    app = make_app()
    app.listen(8888)
    ioloop.IOLoop.current().start()

I sent 50 requests to this at once using aiohttp (49 to the async handler and 1 to the blocking handler), the whole thing finished in ~1s (indicating that the async stuff on the server worked). I printed the request_ids returned by the server on the client side and verified that all 50 were unique, so I think this works fine in tornado

dannygoldstein commented 3 years ago

Here is my client code


# Send 50 requests to the server at once, 49 async, 1 blocking, then 
# verify that they all had separate request contexts on the server

import asyncio
import aiohttp

async def run(url_list):
    tasks = []
    async with aiohttp.ClientSession() as session:
        for url in url_list:
            task = asyncio.ensure_future(session.get(url))
            tasks.append(task)
        responses = asyncio.gather(*tasks)
        await responses
    return responses

async def main():
    url_list = ['http://localhost:8888/async'] * 50

    # send one blocking request
    url_list[25] = url_list[25].replace('async', 'blocking')

    task = run(url_list)
    future = await task
    context_ids = set()
    for r in future.result():
        context_ids.add(await r.text())
    assert len(context_ids) == 50
    print(context_ids)

asyncio.run(main())
(skyportal2) Dannys-MBP:baselayer danny$ time python client.py
{'27fc9539924043adac41c85afb533a2b', 'a867b251fdc54fd3a2de279bb3d66443', 'e22a3d6eb0124db89c5fbbe8f2612615', '5573d8ecd18a48c7bd448f6411291c9f', '64e0fd3daeb14a8a9a14a0e2ea10ce4a', 'e39841b3423d4d96bc6438739ae06e0f', '2262acdc09934ba3b270dd5abbfcb255', 'fbc6a204c3864c98badf08bfd0e59096', '130337ac1c214ba6a958d175f13d5d75', '48b0d98dbd664b35a629424dd2f15a7a', 'aa19e5cf270e4d7982d47657f2993173', '02dfc23ac8664f88b3a2a416637a63a9', '33964eb38b464d159257af4038fbec9d', 'b1bfc42e807f43a0be7c1f528f625818', 'e612464b672241509c8dd9fec34508cc', '2e3b3fb708b14b8ca808e25237a01c45', 'cc8b2750def14b13bda52159617d5faf', 'db3ed5124de745ccaf1cc54ee486531f', '4b5c370d19ad44b7b7bf76d223ef114b', 'a188bec587ed4b5ead8eb42157cc6170', '4a9b479144f14bef943d4940885196a8', '7c33d0fcbcb743138defe5b3b0ad9edc', 'dc939d777581426797cd683d7e8cf0b1', 'e97a2646c08342f98c3e7ad64e7e6777', 'fa6e0d303bbc4482b8fd8831a8f5e5ea', 'a915abefe4ae41e4ad81415cda0a893e', 'c7badc7ed9914767b40d7968de37ec3f', '5d8295d2a7fc402f8720cfbe3f8d20b7', 'b8d1733b1c9547e9ae56f3cca2213d30', '9678c52906984df19760de07c7a15732', 'e0e9d114959f469299747c0d011f05e3', '1defaf9261094c2ba1a31439177ad23c', '94177c00f027487eae2f45179e790f9e', '450d3a461a544ffb9d2fbb3a652aec9e', '906bf7ceaa6f43b88519879d9f0633b5', '3eb41aa7d9c74223934db7a763ff2023', '38a72829dd9145439aa3f43b37bc013d', 'b3f8d359601147b78e81e3d0393c5c20', '9beecbe7b8f34ae79349f61a4fc0f0ff', '5821fd67fcd9410d95bc8a0a30c1474c', '6fb1b16dd32a4dfd8a56ce49f8898b3d', 'be953628b8f446bcb7d892c1fa42f01e', '734fa054f11942e2937ee42f6ad27d7b', 'f89ea2e8df944933855dc39697e4bc99', 'f214da61a1814b2fb0c0ac7ddac1326e', '5f2f9dcb507b4262b557e1fdf3b6270a', '7b6286c19a444f83add35de132cbbc65', '9b41cfe14516455bb3cd715145308649', 'cd45200564a2426e8b0e4aeafbd6a524', '39cd38fe12ee41459f44bce44474ee63'}

real    0m1.650s
user    0m0.223s
sys 0m0.049s
stefanv commented 3 years ago

We have two requirements:

  1. Every new request must execute in its own context
  2. Every async def block must execute in the same context

Danny confirmed (1) above, and I've now also confirmed (2), so this is good to go.

stefanv commented 3 years ago

Thanks, @dannygoldstein!