It would benefit from a large dataset in an "online" setting, that is, live data with new datapoints arriving daily in real-time, a large historical dataset that would be expensive to re-run.
Examples:
a web crawl of something too specialized for google, like
"videos of people playing Minecraft"
photos of a large geographical feature like the Grand Canyon
anything that you might think would be useful for a specialized AI training set
ticket sales or visitor stats of a local museum or library
these are the most interesting datasets and require going outside of class to talk to the staffs
publicly-funded institutions which are paid by tax dollars and with a social good mission might be more amenable to collaborating with students on a school project
Your project will be to relate HyperLogLog to the introductory concepts we are learning in DSA, as well as code up an implementation in a programming language of your choice, run it on any dataset you choose, measure its performance and suitability, critique your implementation, and suggest opportunities for improvement.
For those still looking for a (team) project idea, HyperLogLog is an interesting probabilistic data structure that is worth studying.
https://chengweihu.com/hyperloglog/
It would benefit from a large dataset in an "online" setting, that is, live data with new datapoints arriving daily in real-time, a large historical dataset that would be expensive to re-run.
Examples:
Your project will be to relate HyperLogLog to the introductory concepts we are learning in DSA, as well as code up an implementation in a programming language of your choice, run it on any dataset you choose, measure its performance and suitability, critique your implementation, and suggest opportunities for improvement.