Closed kartikey-vyas closed 4 months ago
Query the models table with a where clause read the input csv create a list of SMILEs delimited with commas compose a simple query select * from {model} where input in ({smiles}) execute and return as df this works for up to 1k inputs, at 10k it breaks down due to throttling (query text itself becomes way too big)
Upload input request to S3 with awswrangler into a "data lake" read and format input csv add supplementary information (model ID and request ID) write back to S3 as a partition parquet (partitioned by model and request) query the precalc database with this query:
query = f"""
select
p.key,
p.input,
p.mw
from
{model_id} p
inner join requests r
on p.input = r.smiles
where
r.model = '{model_id}'
and r.request = '{request_id}';
"""
This way we can scale up our inputs much more. If we have one large precalcs table, we can add a model ID column and specify it as a join key, so we can effectively run multi model queries as well. Testing with 10k inputs still took under 30s for the entire process.
POC notebook done here https://github.com/ersilia-os/model-inference-pipeline/pull/23
We want to use Athena to fetch sets of predictions for users instead of having to hit DynamoDB. This will allow us to retrieve larger sets of data and run more complex queries over the cloud prediction store (now acting as more of a data lake).
Since we don't have a strong requirement for low latency and the process of fetching a model and generating predictions takes several minutes, if we can return many predictions in 20-30s, we still have a really nice experience for the user.