Closed cdawei closed 7 years ago
Compare with the sequence prediction in NLP, where each word is described using its embedding, which can be viewed as a feature vector. These embeddings are generated by tools such as "word2vec", which was trained on a large corpus, this approach ensures that a given word has the same feature vector for both training and test.
I suggest we fix the features of a given POI or transition, for both training and test. However, if we compute POI or transition features on the whole dataset, does it mean we leak information?
POI-query features are compute using the all training+validation set, and keep fixed during Monte-Carlo cross validation. Implemented in ba79260
Currently, POI features (e.g. popularity, average duration) and transition features are computed based-on the training set, when doing cross-validation to tune hyper-parameters, the features of a given POI or transition are different from those for testing (which are computed based-on training + validation set).
In this approach, the hyper-parameters are tuned for features that are different when test, which is abnormal in my opinion.