Higher granularity in scraped ratings

chrismbryant commented 4 years ago

@aeciorc To compute the confidence score, we'll need the fraction of ratings that came from each star value, ideally as an array like [0.1, 0, 0, 0.4, 0.5], with the values corresponding to the percentage distribution you see when you hover over the star rating (i.e. [1 star, 2 stars, ..., 5 stars]). Can your code be modified to retrieve this info?

chrismbryant commented 4 years ago

Note:

On hover over each element with class "a-icon a-icon-popover", a new div is created (and persisted in browser) with class "a-popover a-popover-no-header a-declarative a-arrow-bottom". Buried deep within this div is a table (id="histogramTable"), within which is a series of rows like <tr data-reftag data-reviews-state-param="{"filterByStar":"five_star", "pageNumber":"1"}" class="a-histogram-row">. In each of these rows, there's a link (class="a-link-normal") with a title like "5 stars represent 61% of rating" or "4 stars represent 12% of rating", etc.

aeciorc commented 4 years ago

It can be done, however since the stars distribution are fetched individually, we could either: 1) Fake hovering/fetch each of them when the page is loaded. At ~30 requests, it will be slow and Amazon may not tolerate it 2) Only extract the distribution when the user hovers a rating, append the CI to the popover.

What do you think?

chrismbryant commented 4 years ago

Good point.

I think this may still be worth looking into since I'm not sure how sound the statistics are if we don't have access to the full star distribution. However, I can get a proxy for "satisfaction probability" by scaling star ratings to [0.00, 0.25, 0.5, 0.75, 1.00] so that an average star rating of 4.7 becomes a satisfaction probability of (4.7 - 1)/4= 92.5%, with a confidence interval determined by the number of total ratings. This way, we'd only need the two pieces of information that are available without hovering (the average rating and the number of ratings). I just don't know whether it's statistically okay to treat that problem like a series of Bernoulli experiments. @michielkosters, any thoughts on that?
Regardless of whether we do this for performance reasons, appending to the popover seems like a good idea if we want to include more interesting statistical visualizations at some point.

chrismbryant commented 4 years ago

@aeciorc I added a function evaluateAverageRating(avgRating, numRatings) to the confidence_interval.js file under the shared directory, but I'm not familiar enough with web development to understand how to make that function and stdlib available in the scope of inject.js.

aeciorc commented 4 years ago

Good point.

I think this may still be worth looking into since I'm not sure how sound the statistics are if we don't have access to the full star distribution. However, I can get a proxy for "satisfaction probability" by scaling star ratings to [0.00, 0.25, 0.5, 0.75, 1.00] so that an average star rating of 4.7 becomes a satisfaction probability of (4.7 - 1)/4= 92.5%, with a confidence interval determined by the number of total ratings. This way, we'd only need the two pieces of information that are available without hovering (the average rating and the number of ratings). I just don't know whether it's statistically okay to treat that problem like a series of Bernoulli experiments. @michielkosters, any thoughts on that?

I gave this a shot, unfortunately was blocked for a few minutes because of too many requests. Then I tried waiting 0.25 second between requests and that worked, but it's a kind of a bad user experience. I think we'll have to show the stats only when the user hovers. Maybe we could include an icon beside the ratings to prompt the user to hover them?

aeciorc commented 4 years ago

@aeciorc I added a function evaluateAverageRating(avgRating, numRatings) to the confidence_interval.js file under the shared directory, but I'm not familiar enough with web development to understand how to make that function and stdlib available in the scope of inject.js.

See my last commit, you need to include it on manifest.json. I added confidence_interval.js, I could do the same for stdlib.js, but we'd have have to download it and keep it in the repo, which is kind of annoying. I'll see if there's a better way to do that

chrismbryant commented 4 years ago

@aeciorc Got it. It's good to know that we're not going to be able to get all the popover histograms at once. But I think we should be able to have some reasonable stats using just the average rating and number of ratings alone, as long as we can figure out a way to get stdlib in there

musicin3d commented 4 years ago

Instead of doing all items upfront, could we scrape/calculate/add as they are scrolled into view?

aeciorc commented 4 years ago

Instead of doing all items upfront, could we scrape/calculate/add as they are scrolled into view?

Oh, that's a good idea. Do you want to give that a try?

musicin3d commented 4 years ago

Yep! I started working on it last night and got sidetracked. I'll try to push something up tonight

On Mon, Mar 23, 2020, 9:56 AM Aecio notifications@github.com wrote:

Instead of doing all items upfront, could we scrape/calculate/add as they are scrolled into view?

Oh, that's a good idea. Do you want to give that a try?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/chrismbryant/amazon-confidence-interval/issues/9#issuecomment-602651039, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAICQTN24EPVNTT3HXZAM3TRI52ABANCNFSM4LROWG3A .

musicin3d commented 4 years ago

I worked on the "how do we include dependencies" issue before "when do we calculate things" because it's more fundamental.

aeciorc commented 4 years ago

👍 nice, I'll check it out tomorrow morning

musicin3d commented 4 years ago

I'm noticing that the two methods of calculation (distribution vs. average) produce nearly identical results, to my eyes. I assume distribution should be more accurate, so how do I interpret the cases where the results are not the same on a given product?

Also, using this realization... it may be a good idea to go ahead and populate the page with the results of the average calculation, and replace them onscroll with the distribution's results. For users on a slow connection, this will give them immediately usable info that improves as the more accurate results become available. In most cases, they won't see anything happen. In some cases, there will be a small jump as the graph updates. We could use a slight fading effect to make this feel intuitive.

chrismbryant / amazon-confidence-interval

Higher granularity in scraped ratings #9