Project Idea: HyperLogLog

learner-long-life commented 1 year ago

For those still looking for a (team) project idea, HyperLogLog is an interesting probabilistic data structure that is worth studying.

https://chengweihu.com/hyperloglog/

It would benefit from a large dataset in an "online" setting, that is, live data with new datapoints arriving daily in real-time, a large historical dataset that would be expensive to re-run.

Examples:

a web crawl of something too specialized for google, like
- "videos of people playing Minecraft"
- photos of a large geographical feature like the Grand Canyon
- anything that you might think would be useful for a specialized AI training set
ticket sales or visitor stats of a local museum or library
- these are the most interesting datasets and require going outside of class to talk to the staffs
- publicly-funded institutions which are paid by tax dollars and with a social good mission might be more amenable to collaborating with students on a school project

Your project will be to relate HyperLogLog to the introductory concepts we are learning in DSA, as well as code up an implementation in a programming language of your choice, run it on any dataset you choose, measure its performance and suitability, critique your implementation, and suggest opportunities for improvement.

learner-long-life commented 1 year ago

A good explanation video https://www.youtube.com/watch?v=lJYufx0bfpw&pp=ygUSZ29vZ2xlIGh5cGVybG9nbG9n

learner-long-life commented 1 year ago

Doing your own web crawls for large datasets https://commoncrawl.org/

TheEvergreenStateCollege / upper-division-cs-23-24

Project Idea: HyperLogLog #74