HamoyeHQ / stage-f-06-wine-tasting

This is an open source project for the stage E of the Hamoye Data Science Internship program, cohort 2020, with real life applications in the health, engineering, demography, education and technology.
4 stars 17 forks source link

Add files via upload #33

Closed olaidejoseph closed 3 years ago

olaidejoseph commented 3 years ago

Hi guys, kindly review my work on LSTM.

Jolomi-Tosanwumi commented 3 years ago

Excellent job @olaidejoseph. Only that you forgot to lemmatize spacy_stop_words also. Since we are are lemmatizing our vocabulary via tokenizer function, all stop words need to be lemmatized also. 'become' and 'became' are two different words in spacy_stop_words. I will advice lemmatizing the stop words after compiling them.

Good job overall...I will lemmatize all the stop words in our blended model.

Reviewers can find the notebook here

olaidejoseph commented 3 years ago
  1. The description is not among the data provided, I got it online.

  2. You are right, I should have used the Keras classifier model, I thought it was the model I passed but it wasn't. The train test split result was good.

There is no much difference, jolomi will change it when blending.

On Thu, 12 Nov 2020, 08:40 Sharon Ibejih, notifications@github.com wrote:

@sharonibejih commented on this pull request.

Hi Olaide, this was a huge one here... Thanks a lot.

I'm assuming this LSTM3_wine_review is your most recent notebook. There are a few clarifications I'd like to get.

  1. The three user_inputs you used, had the models been familiar with them earlier either during training or testing?
  2. It could have also been great if the second model the one of KerasClassifier(nlp_model) was made inclusive when doing your final top_5_variety testing on the user_inputs. The model looked really uniform across all cvs compared to the very first model which seemed to be overfitting.

Due to time, I'm not sure point two above will be feasible enough to try. If the answer to point one is NO, then I guess the third model (which you as well named model2), the one of train_test_split, is the best. I think we should proceed with it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/HamoyeHQ/stage-f-06-wine-tasting/pull/33#pullrequestreview-528814697, or unsubscribe https://github.com/notifications/unsubscribe-auth/AM5NYOFXBTUQDYDHL3XHU3DSPOGNPANCNFSM4TSWCUOA .

Jolomi-Tosanwumi commented 3 years ago

Check your f1_score of your three notebooks...ypred is probabilities instead of the one hot encodings. Perhaps, that is why it is unusually higher than the accuracy.

Jolomi-Tosanwumi commented 3 years ago

Also, move the for_fun notebook to the model folder.

olaidejoseph commented 3 years ago

Okay, I will do that.

The first f1_score I calculated using the Keras classifier produces a single result. I used sparse categorical, so I only label encoded my variety column (labels).

For the Keras classifier, model predict produces a single output not a one-hot one.

For the keras, model.fit, since I used softmax, I will get the individual probabilities of the 20 varieties. With the help of np.argmax(), I was able to pick the position of the highest probability, which is ordered in the same way as the label encoding.

This explanation is based on the fun notebook.

On Thu, 12 Nov 2020, 18:50 Jolomi Tosanwumi, notifications@github.com wrote:

Also, move the for_fun notebook to the model folder.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/HamoyeHQ/stage-f-06-wine-tasting/pull/33#issuecomment-726236079, or unsubscribe https://github.com/notifications/unsubscribe-auth/AM5NYOGVNSURD7L3VMPV4QDSPQN6FANCNFSM4TSWCUOA .

Jolomi-Tosanwumi commented 3 years ago

Yea @olaidejoseph. But if you look at this line of code in your notebooks... y_pred_test = model.predict(X_test)

model was first fitted on the whole dataset with a validation split of 0.25, splitting into xtrain and xtest after that means some xtest has been seen by model during fitting. Specifically, model wasn't refitted on xtrain before calling the predict method that is why the testing f1score is unusually higher than the accuracy.

check it out for correction.