TeamHG-Memex / sklearn-crfsuite

scikit-learn inspired API for CRFsuite
426 stars 215 forks source link

Test data format #44

Open Janai2019 opened 4 years ago

Janai2019 commented 4 years ago

X_test = [sent2features(s) for s in test_sents] Looking at the format of the test data, it seems to require a tagged test data to extract features especially, current tag. In reality, the purpose is to tag new data where such information is not present except word features. How do we tag new data?

mani2106 commented 4 years ago

Normally you would tag a set of sentences and split them to train and test/eval sets. To ensure that the model does not overfit (memorize) the training data. We predict with the test data and calculate the scores/metrics and decide whether it is suitable for real-world data.

This is what the example in the documentation does.