Closed dhauser18 closed 5 years ago
Hi,
The problem is, if the word is not present in the vocabulary then it shows this error, but it should work otherwise. For example, if I am typing a word 'master', if I write 'mast' or 'master' it works but for 'maste' it throws error.
Thanks for bringing up this issue. The parameter 'oov_token' was added in keras version 2.1.3 source file, and by setting it to NULL
it should solve the problem. For some reason it is not working. I will train the model again and let you know if it resolves the error.
Added the file which generates the keras model. Please let me know if you need any help.
Hi,
Just add these two lines in Prediciton.R after the tokenizer is loaded and it should get rid of the error. I have updated the files, so you can just download and run.
tokenizer <- load_text_tokenizer(filename = "Data/keras_text_tokenizer")
tokenizer$oov_token <- '<unk>'
fit_text_tokenizer(tokenizer, c('<unk>'))
Thank you very much. The tokenizer$oov_token <- '
That's strange!. The model cannot change since you are just loading the trained weights. I just downloaded the folder and ran the code. It's working perfectly fine for me. Can you please tell me how exactly are you executing the code?
Hi. Thanks again for helping me figure this out. Given the oov_token error, you suggested I add code right after the model is loaded:
tokenizer$oov_token <- '
This allows the system to work without generating the errors I had before. However according to some of your notes on Github (in the Shiny example), it said that for the text "I will kill you" the probability of Toxic is 0.89151 and threat is 0.9761. However my numbers differ.
comment_preds("I will kill you") toxic severe_toxic obscene threat insult identity_hate [1,] 0.02817484 0.006310252 0.007882276 0.008860963 0.01063919 0.009279957
You show another example (text "Why are you so stupid?") that gives high probability of Toxic and Insult), but the results I get are:
comment_preds("Why are you so stupid?") toxic severe_toxic obscene threat insult identity_hate [1,] 0.02124369 0.004242653 0.005663843 0.006379691 0.006861798 0.006401295
So I then tried run your NN code to recreate the model and see if that is the issue. But could not determine which lines should be included (i.e., many lines of code have # in front of them) and I do not have the various Glove files it is calling, and thus could not figure out how to rerun the models from the beginning. Can you suggest anything? (Below is the code as I have it for the prediction with the models). Thank you very much.
model <- load_model_hdf5("D:/Users/750001540/Desktop/Toxic Comments/Toxic Comments v1/Shiny-app/Data/keras_model.h5", compile = F)
tokenizer <- load_text_tokenizer(filename = "D:/Users/750001540/Desktop/Toxic Comments/Toxic Comments v1/Shiny-app/Data/keras_text_tokenizer")
tokenizer$oov_token <- '
maxlen = 150 label = c('toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate')
comment_preds <- function(x){
preds <- data.frame() if(x == "") { preds <- data.frame(toxic = 0, severe_toxic = 0, obscene = 0, insult = 0, threat = 0, identity_hate = 0)
}else{ X_test <- texts_to_sequences(tokenizer, list(x)) X_test <- pad_sequences(X_test, maxlen = maxlen, padding = "post", value = 0, truncating = "post") preds <- model %>% predict(X_test) colnames(preds) <- label return(preds) } }
comment_preds("I will kill you") comment_preds("Why are you so stupid?") comment_preds("Hello, how are you?") comment_preds("shit") comment_preds("You a freaking idiot!") comment_preds("f u")
EDIT: I have added a new tokenizer name 'tuning_keras_text_tokenizer'...please use this and let me know the results.
The results may vary by small amount since my example on GitHub had average of two models. I cannot post the other model because it's not clean and is in various scripts.
Strange but this is what I am getting.
word_index_df <- data.frame(
word = tokenizer$word_index %>% names(),
index = tokenizer$word_index %>% unlist(use.names = FALSE),
stringsAsFactors = FALSE
)
Please compare it with this.
As for training it from scratch
embd_size
.#lines <- readLines(file.path('Data/embeddings/wiki-news-300d-1M-subword.vec'))
...It's an example to load the file.
Hi. I like the simplicity of your code. Looks great. I have a question. When I ran your code, it gave me an error: 'Tokenizer' object has no attribute 'oov_token' Any suggestions what I need to do in the R code to fix this? Also can you share the code you created that generated the original Keras models? I would like to learn from how you did it since your descriptions look interesting. Thanks.