Remember we have 1 million questions with 42 k tags and training this amount of data will be very hectic and difficult. So I thought to consider a small subset of tags. Let’s say C = {42k tags} and C1 is the subset of C. To find the smallest subset c1, we can use the tag count that we’ve plotted earlier. We have the frequency of how many times a tag occurs, so by considering the top set of frequently occurring tags we can cover a maximum number of questions. After checking so many values, I come to know that the top 5500 most frequently occurring tags cover almost 99% of the questions.
Remember we have 1 million questions with 42 k tags and training this amount of data will be very hectic and difficult. So I thought to consider a small subset of tags. Let’s say C = {42k tags} and C1 is the subset of C. To find the smallest subset c1, we can use the tag count that we’ve plotted earlier. We have the frequency of how many times a tag occurs, so by considering the top set of frequently occurring tags we can cover a maximum number of questions. After checking so many values, I come to know that the top 5500 most frequently occurring tags cover almost 99% of the questions.