chengsoonong / digbeta

Active learning for Big Data
GNU General Public License v3.0
25 stars 14 forks source link

Feature varies during learning #107

Closed cdawei closed 7 years ago

cdawei commented 7 years ago

Currently, POI features (e.g. popularity, average duration) and transition features are computed based-on the training set, when doing cross-validation to tune hyper-parameters, the features of a given POI or transition are different from those for testing (which are computed based-on training + validation set).

In this approach, the hyper-parameters are tuned for features that are different when test, which is abnormal in my opinion.

cdawei commented 7 years ago

Compare with the sequence prediction in NLP, where each word is described using its embedding, which can be viewed as a feature vector. These embeddings are generated by tools such as "word2vec", which was trained on a large corpus, this approach ensures that a given word has the same feature vector for both training and test.

I suggest we fix the features of a given POI or transition, for both training and test. However, if we compute POI or transition features on the whole dataset, does it mean we leak information?

cdawei commented 7 years ago

POI-query features are compute using the all training+validation set, and keep fixed during Monte-Carlo cross validation. Implemented in ba79260