Technocolabs100 / Stack-Overflow-Tag-Predictions

Tag Prediction from Stack Overflow Questions
10 stars 11 forks source link

Data Preprocessing #15

Open Technocolabs100 opened 3 years ago

Technocolabs100 commented 3 years ago

You have to follow the below-mentioned steps to process further : i. Sampled 1M data points because of computing and memory limitations. ii. Separated code-snippets from Body iii. Removed Special characters from Question title and description (not in code) iv. Removed stop words (Except ‘C’) v. Removed HTML Tags using Regular Expressions vi. Converted all the characters into small letters vii. Used SnowballStemmer to stem the words Below we can find the example questions after preprocessed.

And now you have to create a new database called ‘Processed.db’ and loaded the preprocessed data into it.

dethebera commented 3 years ago

Would like to work on this @Technocolabs100. Please assign it to me if possible. 😊👍🏻

Abhisheka394 commented 3 years ago

Can you assign me this issue . I'm a GSSOC21 participant.

Technocolabs100 commented 3 years ago

I need to check your previous one then I'll be go with this new issue.

dethebera commented 3 years ago

Thanks for assigning this, Will try to get this done asap. Need to read and figure out certain parts of data for preprocessing. 😄👍🏻