biocore / microsetta-public-api

A public microservice to support The Microsetta Initiative
BSD 3-Clause "New" or "Revised" License
2 stars 4 forks source link

The Taxonomy model is initialized on each GET and it's very expensive #26

Closed wasade closed 4 years ago

wasade commented 4 years ago

Each GET seems to initialize the Taxonomy model, and takes about 2 minutes on my laptop. The interaction with this object is read-only -- is there a reason it needs to be re-initialized on each call rather than being done once at server start up?

wasade commented 4 years ago

I hacked a mechanism in to allow for this to be initialized a single time. It's still not apparent to me why this is so expensive on initialization. The subsequent GETs on an initialized object where ~50ms or so, which is okay. But, I do think it's important a strategy come in where this object is not re-initialized for every GET against it

gwarmstrong commented 4 years ago

Got it. I noticed the slowness before, makes sense that that's happening on initializing the model.

It should be a quick fix to cache the model in microsetta_public_api.resources.resources.

Let me put some thought into benchmarking. I have used asv in the past, which is used by numpy and has a lot of nice features, but could be overkill.

wasade commented 4 years ago

I don't think we're at a stage yet for benchmarking, although some debug print statements w/ real data go a long way :)

However, one thing that could be possibly insightful to do would be to setup a before/after request decorators with flask (see https://flask.palletsprojects.com/en/1.1.x/api/#flask.Flask.before_request). The idea would be to track how long each request takes to statisfiy. This could be done by storing a request identifier (i dont recall what it's called), the URL used, if "sample_ids" is in the request body store the number of samples, and a timestamp into a redis (or sqllite?) database. An after_request could store the same request identifier and a timestamp. This would allow for retroactive assessment of how long it takes to field a request and which type of request.

I don't think this needs to be overthought -- just something simple to capture timings. Redis is really really simple to setup (conda install redis redis-py), and for storing the above it could be as simple as LPUSH requestid-start url number_of_samples start_time (where number of samples is zero if not relevant), and for the after_request, store HSET requestid-end end_time. Sqlite is even easier to setup as it's just there, but a table structure would need to be created.

All that said, while I think URL response times are lower priority than doing the initialization a single time, and exposing beta diversity, PCoAs, and other results oriented value added items.