title: "Entry for Kaggle - Quora Competition"
author: "Team Members: Germayne Ng, Alson Yap"
Competition
Link to competition details and question: https://www.kaggle.com/c/quora-question-pairs
Closing
Our final submission is as such:
Model |
Public Score |
Private Score |
Xgboost 3000 + LSTM |
0.14291 |
0.14602 |
Xgboost 2500 + LSTM |
0.14276 |
0.14598 |
Surpising xgboost 3000 overfits on its own. But when combined with LSTM, works better than 2500. This nets us the silver medal (161/3394)
Update Logs
-
Version Final - :
- Added pagerank features,
overlap and matching coefficients, BOW ( due to time constraint)
- Tuned xgboost parameter. Final xgboost 0.15028
- Ran LSTM model. Score = 0.15848
- Ensemble model. Score = 0.14276
-
Version 2.1 - 31st May 2017:
- Added 4 jelly features (distances)
- Added 2 LSA Component 3 features
- Added 1 weighted intersection feature (q1_q2_wm_ratio)
- Added 2 k-core decomposition features
- Total features: 69
- Improved score by ~0.8
- Score - 0.15244
-
Version 2.0 - 21st May 2017:
- Replaced old TFIDF features
- Dropped 2 magic hash features (-2)
- Added 4 location features and 1 new magic feature part 2 (+5)
- Total features: 60
- Improved score by 0.1 :o
- Score - 0.16010
-
Version 1.9 - 18th May 2017:
- Added Hammering distance and Shared 2 gram features.
- Total features: 57
- Improved score by 0.002
- Score - 0.263XX
-
Version 1.8 - 16th May 2017:
- The original data set for train and test csv files are 'cleaned' up with 'word replacement cleaning.py' script. Rerun all of the generated features, except for Abhishek's features.
- Improved score slightly by 0.005
- score - 0.265XX
-
Version 1.7 - 13th May 2017:
- Revamp LSA features, based on train and test data as 'document' as opposed to referencing them seperately.
- 6 new features: LSA Q1 component 1, 2 , Q2 component 1, 2 and 2 distancing based on the components
- total 55. (note that 2 features will overwrite the old distancing features, so 4 new additions)
- Major change in xgboost.py script: I have split section 5 into a,b,c. For Cross validation, do 5a> 5b. For modelling to submit, do 5a > 5c, skip 5b. Mainly to use full training set to modelling.
- score - 0.27XX
-
Version 1.6 - 12th May 2017:
- Added 4 magic features - 0.30XX
- Added AB 13 features - 0.28XX
- Total 51 features
-
Version 1.5 - 20th April 2017:
- Finally best score so far. Managed to make good use of the LSA components features.
- Distance features can be further distingused. Alson did the distance for single vector. If you define LSA components and apply distance, it can be a seperate feature.
- Based on Alson's functions for distance, I created a euclidean and manhattan function for single vector. Essentially, there are 2 features based on euclidean and manhattan distances (total 4 )
LSA components:
Basically each question is now a vector in LSA-TFIDF components. I.E. Note the values are arbitary, just for example sake:
question |
component 1 |
component 2 |
question |
component 1 |
component 2 |
question 1 of pair 1 |
0.23 |
0.56 |
question 2 of pair 1 |
0.4 |
0.7 |
So, question 1 instead of words is now [0.23 , 0.56] and question 2 [0.4, 0.7].
Now as vectors, we then calculate its distances.
- tune 1000 nrounds and lower learning rate to 0.1 = better results.
-
Version 1.4 - 4th April 2017:
- Expanded on the TFIDF function (3), Added character count without spaces (3) and character count per word (3)
- Total 30 features. Accuracy: 0.32624 (Rank 162: Top 14%)
- Updated features dataset in dropbox
-
Version 1.3 - 1st April 2017:
- Added Jaccard distance and Cosine distance features. Total 21 features. (Rank 133: Top 15%)
-
Version 1.2 - 1st April 2017:
- Added 7 FuzzyWuzzy features. Total 19 features. (Rank 151: Top 15%)
-
Version 1.1 - 31st March 2017:
- Added additional features. Total 12 features. (Rank: 215 Top 25%)
-
Version 1.0 - 30th March 2017:
- Implemented Xgboost with 6 features.
References