Closed Hadrien-Cornier closed 4 years ago
When we will deploy this, I will modify the denominator inside the log such that the tokens most common across crypto and noncrypto classes have a score of 0 and the most discriminative terms have a higher score ( that is instead of having the numerator= #of docs and denominator = # of documents in which a word appears we will have a measure of entropy/gini coefficient)
tokenises the code does tfidf possible improvements : overweight function names