DSCI-310-2024 / data-analysis-review-2024

2 stars 0 forks source link

Submission: Group 13: Laptop Price Predictor Model #13

Open ttimbers opened 3 months ago

ttimbers commented 3 months ago

Submitting authors: Anna Czarnocka, An Zhou, Yuechang Liu, Daniel Lima

Repository: https://github.com/DSCI-310-2024/Group_13-Laptop-market-price-analysis/releases/tag/3.0.0

Abstract/executive summary:

Our project aims to address the question "How can we predict determinants of laptop market prices?" Utilizing the publicly available Kaggle dataset "Laptop Dataset (2024)", we conducted a comprehensive data analysis in Python. Our approach encompassed tasks ranging from data importing to insights sharing, with a focus on establishing replicable and reliable workflows. Employing regression analysis, specifically Ordinary Least Squares (OLS) regression, we explored the relationship between various laptop features and their prices.

The results of our analysis indicate that several laptop features significantly influence prices. These include the laptop's rating, number of cores, number of threads, RAM memory, primary storage capacity, and display resolution. Each of these factors demonstrated a statistically significant impact on laptop prices, as evidenced by their respective coefficients in the regression model.

However, it's important to note that while our model explains a considerable portion (approximately 74.6%) of the variance in laptop prices, the presence of multicollinearity or omitted variable bias cannot be entirely discounted. Further diagnostics may be necessary to thoroughly assess the assumptions and validity of our model.

In conclusion, our study provides valuable insights into the determinants of laptop market prices, offering a foundation for future research and decision-making within the industry.

Editor: @ttimbers

Reviewer: Selena Shew, Saicharanraj Pusuluri, Sophie Yang, Mikel Ibarra Gallardo

selenashew commented 3 months ago

Data analysis review checklist

Reviewer: @selenashew

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5 hours

Review Comments:

Hi Group 13! A huge round of applause to all of the work that your team has put into this project! I was especially blown away by your Quarto report: at 23 pages, it is very clear how thoroughly the results and conclusions were presented, and should definitely be very helpful for those that are looking into buying a new laptop. However, I do have a few items of note:

1) Uncertainty as to the amount of data cleaning conducted for the project. In your report, you state that the original dataset has all of the prices listed as Indian Rupees. However, later on in the report, the prices are presented as USD. There was no mention of whether prices were converted prior to conducting the analysis, nor what the conversion factor that was used (as currency conversions change day-to-day), which does lend some uncertainty as to the validity of the results. From looking at your cleaned training dataset, you have a HP Envy x360 listed as being priced at $96,590, which lends confusion as to when everything was converted or if at all.

2) The methodology changed 3 times throughout the Quarto report. In the Introduction, you state that your team will be making a KNN-regression model. Then, in the Methodology section, you state that you will actually be making a whole variety of different models: decision trees, gradient boosting, random forest (I see someone in your group has also taken CPSC 330/340 LOL). Finally, your actual report then switches gears again and only goes over making a linear regression model. It might be beneficial to take some time to go over the Quarto report and make sure that the chosen methodology is stated correctly throughout.

3) There may be some issues with using a Kaggle dataset- I remember my group originally looked into using that option as well, but after running that idea with the prof, she discussed how she was hesitant to let us use it for reproducibility reasons and that we would have to set up additional Github secrets for account login details, etc., although my memory is admittedly fuzzy on this. The entire analysis has already been completed so there is not much that can be done now (and as long as the teaching team has approved it, that's all that matters!)- but it might be worth bringing this up with the prof and double checking if you do need to set-up additional account details/secrets, etc.

All in all, great job and best of luck with the final milestone! :)

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

syang8203 commented 2 months ago

Data analysis review checklist

Reviewer: @sophieyang

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 2h

Review Comments:

Hi Group 13, overall great job and very interesting project topic! Under our current economic conditions, we are both experiencing a cost-cutting market and digital forward trends, hence laptop price prediction can be very informative for both individuals and corporations! In terms of some areas of improvement:

  1. In the reports folder, it would be very helpful to clean up the files to only have the final versions. Initially, I opened "quarto_report" and found there was a malfunction in the section "Creating an OLS model with all variables". In the PDF and HTML files I'm able to see the headers but there is no text or code. Upon further checking, I realized there is also a "report" PDF and HTML (which I am assuming is the final version), where this section was completed. Having an organized folder would really help clarity from an audience perspective, as I was initially basing the peer review off the "quarto_report" files.
  2. Throughout the analysis process, its a tricky yet super important one to make sure we don't violate the golden rule (also a CPSC 330 concept haha!). I sometimes forget too that this also extends to when we do EDA as we do not want to have a "sneak peak" into the testing data in any way. Hence, when exploring the data through visualizations such as the price distribution, it is best to first split the data and then visualize train_df only as opposed to the entire df. This also applies to when we explore the data such as train_df['Price'].describe() as opposed to df['Price'].describe().
  3. In plotting the correlation matrix, it would have been awesome to go the extra mile and also identify potential cases on multicollinearity. If two predictor variable have a high correlation coefficient (such as num_threads and num_cores) it could suggest multicollinearity which makes it difficult to iscolate the individual effects of the explanatory variables on the target.
  4. I agree with the previous commentator Selena on the changes in methodology, I found when reading through the report this was an area of slight confusion for the audience, and alignment in communication would really help to elevate the overall analysis.

    Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

SaiUbc commented 2 months ago

Data analysis review checklist

Reviewer: @SaiUbc

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5 hour

Review Comments:

Hey Group 13! Congratulations on a compelling examination of laptop price dynamics in today's fluctuating economic landscape. Your approach to applying digital trend analytics to forecast pricing is commendable and undoubtedly very relevant to consumers and businesses alike. Here are some of my thoughts and suggestions for enhancement:

  1. Data Exploration Nuances: The exploratory data analysis (EDA) section is insightful, but there is an opportunity to refine the technique further. For instance, comparing distributions via histograms or boxplots might yield additional insights that summary statistics alone may miss. This way, outliers or peculiar trends would be more visually evident and could potentially influence the modelling approach.

  2. Folder naming: I noticed that the data folder name is all in Uppercase. I know it isn't that significant but considering your script files are in src with a lower case, it might look better to have both folders follow the same convention and only the README and DOCKERFILE being in uppercase.

  3. tests file and data file naming: I noticed that the test file names in your folder all start with tests which is very clear and helpful in understanding which function you are testing. We did the same initially in our group but we realized naming the test scripts as function_test.py rather than test_function.py was much quicker in testing individual test functions while working on testing (assuming multiple people are working on different tests in which some might pass/fail). In quicker in the sense of tab completion on the terminal if that makes sense but thats just a tip we personally found useful. Also I'm not sure why the csv data files have csv in the name, cleaned_data.csv_test.csv which can be named to cleaned-data_test.csv to match with the course file naming guidelines.

  4. Test data placement: If you are testing your model (which you are) you need test data, if you are testing your functions (which you are) you also need test data. Obviously there could be confusion in understanding at a quick glance if test.csv and cleaned_data.csv_test.csv is being used for the same purpose or not. One way to solve it (and potentially minimize the chance of test functions not finding the test files) is to move your function testing data to the test folder by creating a subfolder inside tests called data. We are currently implementing this ourselves after the prof said its okay to do so.

Although i haven't taken CPSC 330 I agree with the previous commentator Sophie on the changes in plotting the correlation matrix, Overall it was amazing to review your project and I wish you the very best of luck for the next milestone and final exams.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.