Santostang / box-office-prediction

Cornell ORIE 4741 Course Project: Machine Learning with Big Messy Data
6 stars 5 forks source link

Final Peer Review - arn39 #15

Open ahaannachane opened 6 years ago

ahaannachane commented 6 years ago

The project aims to predict the box office performance of different movies during their opening weekend. It uses data collected from IMDB, Box Office Mojo, Metacritic and Google Trends. These datasets contain information like the amount grossed by the movie, google search trends, critic reviews and details on the director and actor of the movie among others. There were other constraints imposed on the sample set like the number of theatres etc. The authors have done a good job in making the predictions and the report is well written. I have outlined my perspective on the strengths and weaknesses of the project below.

Strengths: Choice of data sources and their description is commendable. The dataset chosen is feature rich and proved to be useful in making the predictions. These are also the most popular movie critic and review websites so in combination they provide a lot of useful and valuable information that may not have much bias. Models used were good and explained well. The quadratic loss functions with l1 and l2 regularizers were a good choice. The models made sense for the task. Report was well organized and there was some good clarity in the thought process. It was also a well written report with good explanations throughout.

Weaknesses: I felt the sample set to begin with was very small. I know that they mentioned it at the beginning that it was difficult to scrape and they have evaluated it, but still it seemed like for me to begin considering this for any potential applications it would need some more features. Building off this point there was little mention of combatting overfitting especially considering how small the data set is this could definitely impact the ability to generalize the results. I would have liked to see some more in depth treatment of the results of all the models that were used. They mentioned that Huber performed better because of some distinct outliers which were blockbuster hits. I would have liked to see what would have worked if we would have taken those few data points out and tested the models. Would that have changed the predictions by a lot? That would have provided some nice insights to explore. Not sure why the page of pairwise plots is there or how they add much value to the visualizations. The conclusion seemed a bit vague and I would have liked more detail on the promise of commercial application since the project is based on a very commercial sense of the movie industry. How would this apply to the real world and what would need to be changed or explored more had they had more time, data and resources at their disposal.

Overall it was a good and interesting choice of topic and a good analysis and application of concepts that we learned in class. Made for a very interesting read and I enjoyed the writing.