S3 or Dynamo DB to store prediction?

honeyankit commented 1 year ago

Feature Ersilia wants to implement a new feature where the output (key value pairs) are produced by the models based on the molecule (input) and need to be stored in some storage so that when the new model is submitted to Ersilia, it should first check to see if the prediction values exist for that input (molecule) in the storage and pull those values, or else the model should compute on the input values and generate and store the prediction values back to the storage.

Exit Criteria: To select the appropriate storage (S3 or Dynamo DB) which will suit the Ersilia feature.

honeyankit commented 1 year ago

When choosing the storage, we should keep in mind that the price will rely on two factors:

Number of API calls to the storage.
Amount of storage.

S3 Pricing: https://aws.amazon.com/s3/pricing/ Dynamo DB pricing: https://aws.amazon.com/dynamodb/pricing/on-demand/

honeyankit commented 1 year ago

@miquelduranfrigola : What is Ersilia's budgetary allotment for AWS storage? This will have a significant influence on whether S3 or Dynamo DB is chosen.

GemmaTuron commented 1 year ago

Hey @honeyankit,

clearly, S3 is much cheaper than DynamoDB, so if we have the money, what are the advantages of paying DynamoDB? From the little I know about AWS, better query functionalities?

FaithKovi commented 1 year ago

Hello, @honeyankit and @GemmaTuron . I think asides from the pricing the type of data being stored is very important too. Based on your description the data is in key-value pairs. DynamoDB and S3 both have great features. Price: S3 is cheaper than DynamoDB(which of course the budget allotment would be a great determining factor) Management: In the long run, managing bucket policies on S3 can become cumbersome as the application scales while the DynamoDB is a fully managed serverless service. Data: DynamoDB supports key-value and document data models while S3 can store for virtually any use case. Functionality: S3 supports parallel requests while the DynamoDB is a NoSQL database that has high concurrency for read/write requests, and unlimited throughput.

Note: DynamoDB and S3 can be integrated too.

I would recommend DynamoDB though if the budget allocation for AWS storage meets the pricing.

muskansawa commented 1 year ago

Hey @honeyankit,

clearly, S3 is much cheaper than DynamoDB, so if we have the money, what are the advantages of paying DynamoDB? From the little I know about AWS, better query functionalities?

@GemmaTuron dynamodb is suitable for storing tabular data or textual data , as in the for of key value pairs, where key is the field and the value is its value.

as in:

{
'model' : 'x',
'accuracy': 90
}

Whereas S3 is more suitable for storing the files, as in pdf or CSVs, let say we want to store the predicted output.csv file to be stored in a database for future reference so we will store the CSV file in s3 and we might store the link to the s3 bucket and the metadata about the file in dynamodb.

I would like to suggest that the choice of the tool also depend upon the functionalities:

if we want we can directly store each json output in the a dynamobd table, as different rows, storing in dynamodb will enable us to query it later.
or we can store the output csv file in the s3 bucket

miquelduranfrigola commented 1 year ago

Thanks all for this interesting discussion.

We had a meeting with one of our best contributors, who is quite familiar with AWS.

Based on budget and on the usage needs, we decided to go for DynamoDb. We will set this up promptly. We hope it won't be a blocker.

@honeyankit - is this blocking us at this stage? If so, please let us know and will try to have a working solution ASAP.

honeyankit commented 1 year ago

Thank you for all your feedback.

@miquelduranfrigola : DynamoDB would be the preferred choice as it will store molecule and its prediction as a key value pair and will be simple to implement the logic of retrieving and checking the key/value in a single call.

I know we have already selected the DynamoDB. But still I am putting my Initial thoughts on implementing this feature with both the DB.

DynamoDB Implementation

If the model is working on 100 inputs (which means 100 molecules) then it will make 100 calls to DynamoDB to check or retrieve the value.
Each molecule and its prediction value will be stored as key/value pair in DynamoDB
Computation time (retrieving and checking the key/value pair) on Github Action will be saved. As this will be done with a single call to Dynamo DB.

S3 implementation

This can also be done using S3, but that will add to the action compute time.
Each molecule will be stored as a file to S3 bucket along with its prediction value. File name will match exactly with the molecule name. Easy to retrieve later.
If the model is working on 100 inputs (which means 100 molecules) then it will make 200 calls to S3 bucket to check and retrieve the file. The first call will determine whether the file with the molecule name already exists; if it does, a second call will be conducted to download the file from the S3 bucket and read the prediction value.
Downloading the file and reading with the prediction value will be the waste compute time on action runner but it will not be huge.

honeyankit commented 1 year ago

is this blocking us at this stage?

This is not blocking at the moment, just wanted to come on the conclusion for selecting the storage.

honeyankit commented 1 year ago

Based on budget and on the usage needs, we decided to go for DynamoDb

Based on the @miquelduranfrigola comment, we are going with DynamoDB.

ersilia-os / ersilia

S3 or Dynamo DB to store prediction? #301