simple clean dataframe of tfidf on all files within a directory

arthurherbout / crypto_code_detection

Automatic Detection of Custom Cryptographic C Code

8 stars 4 forks source link

simple clean dataframe of tfidf on all files within a directory #12

Closed Hadrien-Cornier closed 4 years ago

Hadrien-Cornier commented 4 years ago

tokenises the code does tfidf possible improvements : overweight function names

Hadrien-Cornier commented 4 years ago

When we will deploy this, I will modify the denominator inside the log such that the tokens most common across crypto and noncrypto classes have a score of 0 and the most discriminative terms have a higher score ( that is instead of having the numerator= #of docs and denominator = # of documents in which a word appears we will have a measure of entropy/gini coefficient)