Cost regarding the usage of s3-connector-for-pytorch

TommasoBendinelli commented 7 months ago

Dear AWS Labs Team,

I'm writing to inquire about the underlying mechanism used by the S3 connector. Specifically, does it rely on LIST requests to access objects within an S3 bucket?

If so, Is the following cost estimation associated with iterating over objects for neural network training correct? Let's consider a scenario with a bucket containing 100,000 objects and with standard cost of the LIST request being 0.0005/1k USD. If I train a network for 10 epochs (and in each epoch I do a full pass of the dataset) would the total cost for these requests be approximately 100k 10 0.0005 USD / 1k = 0.5 USD? Thank you for your time and assistance.

Best regards,

dnanuti commented 7 months ago

Hi Tommaso,

Thanks for reaching out and for your interest in our connector! I need to research a bit the cost aspect and get back to you.

Thanks, Diana.

dnanuti commented 7 months ago

Hi Tommaso,

Following up on your question, before jumping into costs, let me give you a bit of context on how our connector works for datasets creation. S3 Connector for PyTorch provides two ways for constructing your dataset: 1/ passing in a prefix (which will do under the hood LIST request(s)) 2/ providing an iterable of objects URIs (in case you can determine that upfront and save the cost and time of listing, for example if your keys respect a certain pattern, you can pass in a generator for these)

Let’s dive deep on option 1:

S3MapDataset

For S3MapDataset.from_prefix(...), we will try to list the objects under the prefix independent of the number of epochs, as we cache the result, but it is dependendent on the number of dataset copies for workers. For example:
```
dataset = S3MapDataset.from_prefix(s3_uri=prefix, region=region, transform=...)
dataloader = DataLoader(dataset, num_workers=0)
for epoch in range(num_epochs):
# go through dataloader items
```
num_workers = 0 -> will list the objects under the prefix once
num_workers = 2 -> will try to list the objects under the prefix once for the DataLoader copy, then once for each worker copy, so it will try to list them 3 times For S3MapDataset the listing will occur (num_workers + 1) times.

S3IterableDataset

For S3IterableDataset.from_prefix(...), the objects under the prefix will be listed (num_workers * num_epochs) times, so for example for num_workers = 3, num_epochs = 5, the listing will be done 3 * 5 = 15 times.

Please note that in this context, for both dataset types, listing the objects under the prefix once is not necessarily equivalent with one LIST request to S3.

Now, let’s see what listing the objects under the prefix once means in terms of requests that actually get to S3. The call we make for listing is ListObjectsV2, which:

Returns some or all (up to 1000) of the objects in a bucket with each request.

In your specific scenario of 100000 objects, we’ll issue 100000/1000 = 100 ListObjectsV2 requests for listing once the objects under the given prefix. According to the official calculator, listing the objects once in your case, so 100 LIST requests will cost: 100 LIST requests for S3 Standard Storage x 0.000005 USD per request = 0.0005 USD (S3 Standard LIST requests cost)

Then, depending on your implementation, this will get multiplied as explained above. For a more accurate estimation for your use case please try to use the calculator as you know best the details of your implementation.

Of course, in addition to this, there are also the costs associated with the actual retrieval of the object content when going through the dataloader’s items.

TommasoBendinelli commented 7 months ago

Thank you for your detailed response. I have a couple of follow-up questions:

Is it possible to cache the S3MapDataset before training begins and make this cache persistent? This way, if I need to restart training, I wouldn't have to repeat the caching process every time for each worker.
If the S3 bucket is located in the same region as the training process, then there should be no cost of retrieval of the object content, right?

fuatbasik commented 5 months ago

Hi Tommaso,

Thanks a lot for your interest in using S3 Connector for PyTorch.

Caching is a interesting topic for the PyTorch connectors. Are you interested in caching object keys (to avoid lists) or object data? If former, you can always use S3MapDataset.from_objects() or S3IterableDataset.from_objects() methods those let you pass list of S3Keys you like to access. We would love to learn more about your use-case so we can understand how caching would benefit your workload?

You would not pay for data-transferred if your S3 bucket and compute is located in the same AWS region. There would still be request cost though, which is 0.0004 USD for 1000 requests for S3 Standard in us-east-1.

Please let us know if you have any follow-up questions.

Fuat.

fuatbasik commented 4 months ago

Hi @TommasoBendinelli Is my answer above answered your questions? I am going to close this issue for now, please let us know if you have any other questions.

awslabs / s3-connector-for-pytorch

Cost regarding the usage of s3-connector-for-pytorch #203

S3MapDataset

S3IterableDataset