ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
203 stars 131 forks source link

S3 or Dynamo DB to store prediction? #301

Closed honeyankit closed 1 year ago

honeyankit commented 1 year ago

Feature Ersilia wants to implement a new feature where the output (key value pairs) are produced by the models based on the molecule (input) and need to be stored in some storage so that when the new model is submitted to Ersilia, it should first check to see if the prediction values exist for that input (molecule) in the storage and pull those values, or else the model should compute on the input values and generate and store the prediction values back to the storage.

Exit Criteria: To select the appropriate storage (S3 or Dynamo DB) which will suit the Ersilia feature.

honeyankit commented 1 year ago

When choosing the storage, we should keep in mind that the price will rely on two factors:

  1. Number of API calls to the storage.
  2. Amount of storage.

S3 Pricing: https://aws.amazon.com/s3/pricing/ Dynamo DB pricing: https://aws.amazon.com/dynamodb/pricing/on-demand/

honeyankit commented 1 year ago

@miquelduranfrigola : What is Ersilia's budgetary allotment for AWS storage? This will have a significant influence on whether S3 or Dynamo DB is chosen.

GemmaTuron commented 1 year ago

Hey @honeyankit,

clearly, S3 is much cheaper than DynamoDB, so if we have the money, what are the advantages of paying DynamoDB? From the little I know about AWS, better query functionalities?

FaithKovi commented 1 year ago

Hello, @honeyankit and @GemmaTuron . I think asides from the pricing the type of data being stored is very important too. Based on your description the data is in key-value pairs. DynamoDB and S3 both have great features. Price: S3 is cheaper than DynamoDB(which of course the budget allotment would be a great determining factor) Management: In the long run, managing bucket policies on S3 can become cumbersome as the application scales while the DynamoDB is a fully managed serverless service. Data: DynamoDB supports key-value and document data models while S3 can store for virtually any use case. Functionality: S3 supports parallel requests while the DynamoDB is a NoSQL database that has high concurrency for read/write requests, and unlimited throughput.

Note: DynamoDB and S3 can be integrated too.

I would recommend DynamoDB though if the budget allocation for AWS storage meets the pricing.

muskansawa commented 1 year ago

Hey @honeyankit,

clearly, S3 is much cheaper than DynamoDB, so if we have the money, what are the advantages of paying DynamoDB? From the little I know about AWS, better query functionalities?

@GemmaTuron dynamodb is suitable for storing tabular data or textual data , as in the for of key value pairs, where key is the field and the value is its value.

as in:

{
'model' : 'x',
'accuracy': 90
}

Whereas S3 is more suitable for storing the files, as in pdf or CSVs, let say we want to store the predicted output.csv file to be stored in a database for future reference so we will store the CSV file in s3 and we might store the link to the s3 bucket and the metadata about the file in dynamodb.

I would like to suggest that the choice of the tool also depend upon the functionalities:

miquelduranfrigola commented 1 year ago

Thanks all for this interesting discussion.

We had a meeting with one of our best contributors, who is quite familiar with AWS.

Based on budget and on the usage needs, we decided to go for DynamoDb. We will set this up promptly. We hope it won't be a blocker.

@honeyankit - is this blocking us at this stage? If so, please let us know and will try to have a working solution ASAP.

honeyankit commented 1 year ago

Thank you for all your feedback.

@miquelduranfrigola : DynamoDB would be the preferred choice as it will store molecule and its prediction as a key value pair and will be simple to implement the logic of retrieving and checking the key/value in a single call.

I know we have already selected the DynamoDB. But still I am putting my Initial thoughts on implementing this feature with both the DB.

DynamoDB Implementation

S3 implementation

honeyankit commented 1 year ago

is this blocking us at this stage?

This is not blocking at the moment, just wanted to come on the conclusion for selecting the storage.

honeyankit commented 1 year ago

Based on budget and on the usage needs, we decided to go for DynamoDb

Based on the @miquelduranfrigola comment, we are going with DynamoDB.