Open eriq-augustine opened 7 years ago
@dhawaljoh These initial finding may be interesting to you.
A rough weight learning run with just the subset of the ground truth gives these weights: STARS = 1.5 TOTAL_REVIEW_COUNT = 0.0 AVAILABLE_REVIEW_COUNT = 1.0 MEAN_REVIEW_LEN = 2.0 MEAN_WORD_LEN = 2.0 NUM_WORDS = 0.0 MEAN_WORD_COUNT = 0.0 TOTAL_HOURS = 0.5 ATTRIBUTES = 2.0 CATEGORIES = 2.0 TOP_WORDS = 1.0 KEY_WORDS = 1.0 OPEN_HOURS = 0.0
Rand Index = 0.966339
Recall that the weight is just multiplied to the normalized distance score, so high does not necessarily mean important.
Maybe we need non-linear weights. Maybe just inverse the weights.
It just seems a little strange since 0 means don't use the feature; but the closer to 0 the distance is, the more similar they are.
@dhawaljoh
Oh shit, inverse weights to much better. 1.0 0.0 2.0 0.0 0.5 0.0 0.0 0.0 1.0 0.5 1.0 1.5 0.0 0.995792
STARS = 1.0 TOTAL_REVIEW_COUNT = 0.0 AVAILABLE_REVIEW_COUNT = 2.0 MEAN_REVIEW_LEN = 0.0 MEAN_WORD_LEN = 0.5 NUM_WORDS = 0.0 MEAN_WORD_COUNT = 0.0 TOTAL_HOURS = 0.0 ATTRIBUTES = 1.0 CATEGORIES = 0.5 TOP_WORDS = 1.0 KEY_WORDS = 1.5 OPEN_HOURS = 0.0
Rand Index = 0.995792
Keep in mind that these numbers are "weights" which now means (1/w * distance) is what is computed. Also keep in mind that this is just on a small arbitrary (but not random) subset of the ground truth.
Learn some feature weights.