MIT-LCP / physionet-build

The new PhysioNet platform.
https://physionet.org/
BSD 3-Clause "New" or "Revised" License
55 stars 20 forks source link

S3: usage monitoring #2098

Open bemoody opened 12 months ago

bemoody commented 12 months ago

(Splitting this off from #2093 as this is kind of a separate issue, and affects both public and restricted data.)

As we post data on Amazon S3, we want to be able to gather some metrics of usage. "Metrics" might include things like:

Gathering such metrics isn't essential, nor do I particularly care which metrics we capture, but it would be highly desirable to have some way of measuring a project's usage. We should try to understand the monitoring services that Amazon provides, insofar as their pricing and technical limitations may have a major impact on how we want to structure the data buckets.

As usual, I have zero actual experience or inside knowledge and am trying to guess, based on the buzzword-infested public documentation, how the AWS system actually works and what services might provide what we're looking for.

bemoody commented 12 months ago

CloudWatch (https://docs.aws.amazon.com/AmazonS3/latest/userguide/cloudwatch-monitoring.html)

This provides (https://aws.amazon.com/cloudwatch/pricing/):

These are supposedly provided with one-minute granularity.

I don't know what a "Metric" is - a single number? If we wanted to collect four Metrics per month for each published project, that's already a non-trivial expense.

Also note this (https://docs.aws.amazon.com/AmazonS3/latest/userguide/metrics-configurations.html):

You can have a maximum of 1,000 metrics configurations per bucket.

Is a "Metrics Configuration" the same thing as a "Metric"? I'm guessing not.

If we had 100 buckets, and we wanted to know the number of requests for each bucket, how many Metrics Configurations would be required? How many Metrics would we be charged for?

If we had 100 prefixes within a single bucket, and we wanted to know the number of requests for each prefix, how many Metrics Configurations would be required? How many Metrics would we be charged for?

bemoody commented 12 months ago

S3 Storage Lens (https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage_lens_basics_metrics_recommendations.html)

This provides (https://aws.amazon.com/s3/pricing/):

(FWIW, we currently have about 31 million files on PhysioNet.)

Metrics are supposedly provided with one-day granularity (or at least, the data is exported once per day.)

Here they document what things can be measured: https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage_lens_metrics_glossary.html

"Advanced metrics" include "Prefix aggregation" which sounds like what we'd want. It's hard to find documentation, but the example JSON file shown here is suggestive: https://docs.aws.amazon.com/AmazonS3/latest/userguide/S3LensCLIExamples.html

bemoody commented 12 months ago

Finally, another possibility would be to store complete request logs (which can be dumped into another S3 bucket) and analyze them ourselves. The storage and transfer costs would likely be considerable.