Closed seo-95 closed 4 years ago
The filtering from 33 to 7 attributes is only a design choice of the organizers for the baseline they published. The challenge instead is on the prediction of the whole set of values for the attributes (33).
Adding the clarification from a private thread here (for completion):
Picking 7 out of all the possible 33 attributes is a modeling choice the baselines take due to distribution of data. The rest of the attributes are then mapped to "other" for ease of modeling. However, evaluation should not take into account this relaxation as it would then penalize models that can potentially identify all 33 attributes. Due to this assumption the baselines will take a hit in performance trading off for simplicity. Hence, I did not restrict the evaluation to these 7 choices.
The baselines are trained to predict attributes for each action as a multilabel prediction problem where each attribute can assume value in a set of 7 possible outcomes: {"availableSizes", "price", "brand", "customerRating", "info", "color"} + "others".
The variety of attribute values included in the fashion dataset is instead 33. During the training, all the attributes of the training set not included in the desired subset are replaced with the value "others". Anyway, this mechanism is not included in the scorer, resulting in a comparison between 33 possible ground truth values vs only 7 that can be predicted by the model.
This pull request adds the attribute values filtering on the ground truth labels also.