Data Preparation - Githubissues

Technocolabs100 commented 3 years ago

Remember we have 1 million questions with 42 k tags and training this amount of data will be very hectic and difficult. So I thought to consider a small subset of tags. Let’s say C = {42k tags} and C1 is the subset of C. To find the smallest subset c1, we can use the tag count that we’ve plotted earlier. We have the frequency of how many times a tag occurs, so by considering the top set of frequently occurring tags we can cover a maximum number of questions. After checking so many values, I come to know that the top 5500 most frequently occurring tags cover almost 99% of the questions.

A-kriti commented 3 years ago

Hello @Technocolabs100 , I would like to contribute to this issue as a GSSOC'21 participant. So could you please assign me this?

Technocolabs100 commented 3 years ago

ok I will do that

Technocolabs100 / Stack-Overflow-Tag-Predictions

Data Preparation #17