Closed Jarrus00 closed 1 year ago
Results and Findings
After a full semester of cleaning, planning, analyzing, and implementing algorithms it is safe to say that while we learned a lot, we couldn’t produce our learnings as much as we would like. Our data was an Amazon Co-purchasing Analysis Text file containing metadata and review information for 548,552 products scraped from Amazon's website. Each product had multiple reviews extending the data into a multimillion-dimensional data set. While these data sets are what we should expect to work with in the industry, for many of us it was our first time working with a data set this large and it came with many complications, the main one being scalability. The main problem we wanted to take on was implementing a co-purchasing data analytics engine using SQL-like queries to examine searchable and non-searchable (indirect) attributes which support inequality operators, finds the first k entries with minimized evaluation cost, and user pattern detection via supervised learning. With these goals in mind, we decided to implement three algorithms: collaborative filtering, bi-directional, and backward search. Of these three we were only able to get collaborative filtering to work and fully integrated it into our user Interface, although it did have its fair share of challenges. We quickly realized collaborative filtering for a matrix with over 500,000 values takes up too much time, so we created randomized training sets that utilized smaller matrices to complete the filtering. That. Was not the only problem we ran into as we Quickly realized our matrices were too sparse to accurately provide a recommendation. We needed customers and products with enough ratings to provide an accurate recommendation or the code return zero products. Through this we created another filter that requested all products have a minimum of three ratings which decreased the sparsity and ultimately led to viable product predictions. We see the use of collaborative filtering in almost all company platforms whether it’s on Spotify for music recommendations, or booking.com to help users discover new interests they never knew exited our they would like. From this mindset, we hope to be able to recommend similar products for users based on what other users rated it. Hopefully this will lead to an increase in both customer satisfaction and sales for amazon. For our user interface we Used QT designer to build the skeleton of the UI but used python to make the methodology behind our vision work. As previously mentioned, we wanted to incorporate SQL-like queries to examine searchable and non-searchable attributes throughout our data to find useful information like about our users and products. From this we were able to find the average rating for each user which allows us to see whether a particular customer generally rates low or was dissatisfied with a product. We can also see that the most popular category reviewed is books, and the least popular category is DVD’s. We can even dive further into the search and see which products are the most popular in each category, which can on the supply chain side when predicting how many units to order. We have also created two visuals to show our findings one which is a PDF and one that is a CDF showing the count of ratings per customer. From these two distribution graphs we can see that as the count of ratings increases it is easier to predict a more accurate rating which coincides with basic statistics in that a higher n creates a. more accurate finding in terms of using a greater population size to speak for the population.
Integrated into final report.
The results from these queries can be applied to many fields of study, but a large application for this information can be used in economics. Not only could Amazon use this information and plan accordingly to what items and categories they would prioritize, but at a deeper level an economist could interpret the information as trends within the spending habits of the everyday consumer. For example, the lowest category in this dataset was video games, this can be largely attributed to the fact that most users purchase digital copies of games directly on their console or PC. Also, if they wish to purchase a physical copy of the game, amazon does not often directly sell these from the distributor, unlike big box stores like Target, Walmart, or even GameStop. This can cause prices to rise and drive buyers away. On the other hand, the main category is books, which intuitively makes sense because of Amazon's roots as a book distribution company. Another reason for this however is the kindle service that Amazon owns and maintains. The Kindle device allows users to purchase, rate, and review books anywhere. The Kindle service is not limited to the devices alone though, as they have a mobile app, and now have a subscription service merged with Audible. Amazon prioritizes their book sales over almost every other sale they make on their site.
Another field of study this information can apply to is psychology. Users often rated baby furniture/supplies poorly, which can speak volumes about the standard to which we hold goods targeted towards children. Often, parents will be much less likely to care about their own goods, as long as what their child has is of high quality. That being said, with the low ratings, and low number of ratings for children's goods, it would seem that customer service is a large influence on parent's purchasing habits. One could draw the conclusion that a parent has to build up a trust with an item before they will give it to their child. With an expanded search on children's goods as well as childcare equipment, one could learn many different habits that parents and caregivers have when caring for children.
Background: We need a segment which describes our findings from analyzing this dataset.
Problem: What did our analyses reveal from this dataset?
Success Criteria: