anyscale / academy

Ray tutorials from Anyscale
https://anyscale.com
Apache License 2.0
586 stars 195 forks source link

calculate the distance between user's ratings to cluster's centers #35

Open YeziPeter opened 3 years ago

YeziPeter commented 3 years ago

In ray-rllib/recsys/01-Ressys, i found there may be a problem in calculate the distance between user's ratings to cluster's centers. It is in the step function in env class JokeRec:   scaled_diff = abs(c[item] - rating) / 2.0 The shape for c (which is centers[i]) is 1* 24983, stands for the features in i cluster. However, item is is randomly chosen from the cluster, and the range is [0, 99]. The rest [100, 24983] in the center[i] cannot be searched. Is c[item] - rating a correct way to calculate that distance?

ceteri commented 3 years ago

Thanks for pointing out that code @YeziPeter , it needs more comments describing what happens at that point. Had to dig through to find an answer for that one :)

In the JokeRec.load_data() method where the data gets loaded, these ratings are scaled in advance. The raw data ranges [-10, 10] but the scaled data ranges [-1.0, 1.0]

Then these scaled rating values get used as the sample data for the clustering.

The c[item] value is the cluster center for ratings of a particular item (an individual joke), not a number of users who've rated an item. This is scaled the same way as the rating values.

Does that help?