Mrugankakarte / Toxic-Comments-Classifier

Kaggle Competiton
GNU General Public License v3.0
0 stars 0 forks source link

Tokenizer object has no attribute 'oov_token' #1

Closed dhauser18 closed 5 years ago

dhauser18 commented 5 years ago

Hi. I like the simplicity of your code. Looks great. I have a question. When I ran your code, it gave me an error: 'Tokenizer' object has no attribute 'oov_token' Any suggestions what I need to do in the R code to fix this? Also can you share the code you created that generated the original Keras models? I would like to learn from how you did it since your descriptions look interesting. Thanks.

Mrugankakarte commented 5 years ago

Hi,

The problem is, if the word is not present in the vocabulary then it shows this error, but it should work otherwise. For example, if I am typing a word 'master', if I write 'mast' or 'master' it works but for 'maste' it throws error.

Thanks for bringing up this issue. The parameter 'oov_token' was added in keras version 2.1.3 source file, and by setting it to NULL it should solve the problem. For some reason it is not working. I will train the model again and let you know if it resolves the error.

Added the file which generates the keras model. Please let me know if you need any help.

Mrugankakarte commented 5 years ago

Hi,

Just add these two lines in Prediciton.R after the tokenizer is loaded and it should get rid of the error. I have updated the files, so you can just download and run.

tokenizer <- load_text_tokenizer(filename = "Data/keras_text_tokenizer")
tokenizer$oov_token <- '<unk>'
fit_text_tokenizer(tokenizer, c('<unk>'))
dhauser18 commented 5 years ago

Thank you very much. The tokenizer$oov_token <- '' was the key to getting this to run. I have another question for you. Now the R Shiny app runs but when I populate the inputText field on Shiny with any toxic, severe toxic, or other flagged content, the resulting predicted probabilities are extremely low - meaning I have not found any examples of any comments that have high predicted values. Have I implemented this incorrectly or perhaps the model is changed somehow? Thanks so much!! Your work is tremendous, insightful, and helps me understand a lot better. Thank you very much!!

Mrugankakarte commented 5 years ago

That's strange!. The model cannot change since you are just loading the trained weights. I just downloaded the folder and ran the code. It's working perfectly fine for me. Can you please tell me how exactly are you executing the code?

dhauser18 commented 5 years ago

Hi. Thanks again for helping me figure this out. Given the oov_token error, you suggested I add code right after the model is loaded:

tokenizer$oov_token <- '' fit_text_tokenizer(tokenizer, c(''))

This allows the system to work without generating the errors I had before. However according to some of your notes on Github (in the Shiny example), it said that for the text "I will kill you" the probability of Toxic is 0.89151 and threat is 0.9761. However my numbers differ.

comment_preds("I will kill you") toxic severe_toxic obscene threat insult identity_hate [1,] 0.02817484 0.006310252 0.007882276 0.008860963 0.01063919 0.009279957

You show another example (text "Why are you so stupid?") that gives high probability of Toxic and Insult), but the results I get are:

comment_preds("Why are you so stupid?") toxic severe_toxic obscene threat insult identity_hate [1,] 0.02124369 0.004242653 0.005663843 0.006379691 0.006861798 0.006401295

So I then tried run your NN code to recreate the model and see if that is the issue. But could not determine which lines should be included (i.e., many lines of code have # in front of them) and I do not have the various Glove files it is calling, and thus could not figure out how to rerun the models from the beginning. Can you suggest anything? (Below is the code as I have it for the prediction with the models). Thank you very much.

Load the model Mrugank Akarte made (posted on GitHub)

model <- load_model_hdf5("D:/Users/750001540/Desktop/Toxic Comments/Toxic Comments v1/Shiny-app/Data/keras_model.h5", compile = F)

load_model_weights_hdf5(model, "D:/Users/750001540/Desktop/Toxic Comments/Toxic Comments v1/Shiny-app/Data/keras_model_weights.h5")

tokenizer <- load_text_tokenizer(filename = "D:/Users/750001540/Desktop/Toxic Comments/Toxic Comments v1/Shiny-app/Data/keras_text_tokenizer")

tokenizer$oov_token <- '' fit_text_tokenizer(tokenizer, c(''))

Options for the prediction

maxlen = 150 label = c('toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate')

Prediction function

comment_preds <- function(x){

preds <- data.frame() if(x == "") { preds <- data.frame(toxic = 0, severe_toxic = 0, obscene = 0, insult = 0, threat = 0, identity_hate = 0)

}else{ X_test <- texts_to_sequences(tokenizer, list(x)) X_test <- pad_sequences(X_test, maxlen = maxlen, padding = "post", value = 0, truncating = "post") preds <- model %>% predict(X_test) colnames(preds) <- label return(preds) } }

comment_preds("I will kill you") comment_preds("Why are you so stupid?") comment_preds("Hello, how are you?") comment_preds("shit") comment_preds("You a freaking idiot!") comment_preds("f u")

Mrugankakarte commented 5 years ago

EDIT: I have added a new tokenizer name 'tuning_keras_text_tokenizer'...please use this and let me know the results.

The results may vary by small amount since my example on GitHub had average of two models. I cannot post the other model because it's not clean and is in various scripts.

test

Strange but this is what I am getting.

  1. Do you have the GPU version of Keras installed? I don't know if that's the issue.
  2. Can you compare your dictionary with mine? If the index of the words is different then it's the reason, but I am pretty sure, the dictionary should be same.
word_index_df <- data.frame(
       word = tokenizer$word_index %>% names(),
       index = tokenizer$word_index %>% unlist(use.names = FALSE),
       stringsAsFactors = FALSE
)

Please compare it with this. test2

Mrugankakarte commented 5 years ago

As for training it from scratch

  1. Un-comment everything except for the experimental part wher AUC_RoC function is defined..You can delete that part.
  2. You will need the embeddings file which can be downloaded from here
  3. I have used 'glove.twitter.27B.200d.txt', you can chose any just make sure you make that change in embd_size.
  4. You can delete this line #lines <- readLines(file.path('Data/embeddings/wiki-news-300d-1M-subword.vec'))...It's an example to load the file.
  5. Make sure you save the embeddings to your desktop once it's done because it takes a lot of time and you don't want to do that again and again. It's the main reason of commenting out the code. It was run once...saved to desktop and then used as required again and again.