denoland / deno_registry2

The backend for the deno.land/x service
https://deno.land/x
MIT License
92 stars 14 forks source link

Track download counts #57

Open lucacasonato opened 4 years ago

lucacasonato commented 4 years ago

Goal

We want to have a graph of module download counts like crates.io has. This means that we should have a list of the download counts per module (or module version, or per file) per day.

How to implement

Through discussions with @wperron on Discord we came up with two relatively simple solutions:

  1. SQS + Lambda + MongoDB
    • Have the CloudFlare Worker that serves the raw files add an event to a SQS queue every time it serves a file.
    • Have a AWS Lambda take events out of this queue in batches of 500 and persist them into MongoDB.
    • Create an API endpoint that serves the download count per module (or per module version or per file).
  2. Kinesis Firehose + S3 + Lambda + MongoDB
    • Have the CloudFlare Worker that serves the raw files add an event to a Kinesis Firehose stream every time it serves a file.
    • Have Kinesis Firehose persist this data into S3 as batches
    • Have a cron triggered Lambda that takes batches out of S3 and persists them into MongoDB
    • Create an API endpoint that serves the download count per module (or per module version or per file).
  3. Cloudflare Workers + Cloudflare Analytics API + MongoDB

~I am personally more in favour of solution 1 because I feel it is relatively simple to set up (haven't used Kinesis Firehose before).~

I prefer option 3 if we have access to the Cloudflare Logpull API. You need to be an enterprise customer to make use of it though.

Decisions to make

wperron commented 4 years ago

There's also the possibility of passing the data straight from Firehose to MongoDB through by configuring the Firehose to use an HTTP destination and pointing to Mongo's HTTP API. It would save on the cost of the S3 bucket and Lambda execution.

lucacasonato commented 4 years ago

In this case we would be storing each event in MongoDB though right? That seems kinda bad because of the amount of events we receive on a daily basis. I think we should have a single MongoDB document per module, (or module version or file) that tracks the download count per day.

wperron commented 4 years ago

That's a good point, I don't think Firehose allows for aggregating data on the fly. I believe Kinesis Analytics can do that, but I've never used it and I don't know how easy it would be to integrate with MongoDB.

I agree: SQS seems to be the simplest option here, there's no point in throwing data into an s3 bucket as an intermediate location.

n10000k commented 4 years ago
  1. CloudFront
  2. Lambda @ Edge
  3. Kinesis stream (aggregate http 206)
  4. Lambda function to store in MongoDB update count

I feel like we should setup a Terraform config to setup AWS also, open to this @lucacasonato to cover all the AWS services used. That way if you need to scale it's a case of changing something on the fly and it's code-able infrastructure then?

n10000k commented 4 years ago

I would also look into: https://docs.aws.amazon.com/AmazonS3/latest/dev/analytics-storage-class.html

lucacasonato commented 4 years ago

@narwy We don't use CloudFront or Lambda @ Edge at the minute. (Also I quite dislike Lambda @ Edge because of the unreasonable pricing and runtime limitations - I don't want to use Node or Python). I want to stick to the CloudFlare Worker we have now.

I feel like we should setup a Terraform config to setup AWS also

We have a CloudFormation config. I don't see the point of moving it to Terraform. I have had enough trouble with CloudFormation (it finally works) and don't really fancy repeating that 'fun' at the minute :-). Maybe in a month or two

wperron commented 4 years ago

Any updates on the CF analytics for this issue?

lucacasonato commented 4 years ago

Any updates on the CF analytics for this issue?

Nothing yet.