HazyResearch / fonduer

A knowledge base construction engine for richly formatted data
MIT License
408 stars 78 forks source link

Is this the right way to test the saved emmental models? #511

Open saikalyan9981 opened 4 years ago

saikalyan9981 commented 4 years ago

I have gone through the code of packaging in ML Flow. Thank you, It was very useful for me. While testing, I think the code here hardware_fonduer_model classifies one document a time. However I would like to test in multiple documents at once:

So Is this code snippet correct, to test the model

#Loading Model
model_dict = pickle.load(open(os.path.join(model_path, f"{model_name}.pkl"), "rb"))
key_names = model_dict["feature_keys"]
word2id = model_dict["word2id"]
emmental_model = _load_emmental_model(model_dict["emmental_model"])

## Extracting Test Candidates and getting features
candidateExtractor = CandidateExtractor(session, [candidate])
candidateExtractor.apply(context, split=0, parallelism=PARALLEL)
test_cands = candidateExtractor.get_candidates(split = 0)
featurizer = Featurizer(session, [candidate])
# featurizer.apply(split=0, train=False, parallelism=PARALLEL)

# featurizer.drop_keys(key_names)

F_test = featurizer.get_feature_matrices(test_cands)
test_dataloader = EmmentalDataLoader(
            task_to_label_dict={candidate_name: "labels"},
            dataset=FonduerDataset(candidate_name, test_cands[0], F_test[0], word2id, 2),

# Setup config
config = {
    "meta_config": {"verbose": True},
    "model_config": {"model_path": None, "device": 0, "dataparallel": False},
    "learner_config": {
        "n_epochs": 50,
        "optimizer_config": {"lr": 0.001, "l2": 0.0},
        "task_scheduler": "round_robin",
    "logging_config": {
        "evaluation_freq": 1,
        "counter_unit": "epoch",
        "checkpointing": False,
        "checkpointer_config": {
            "checkpoint_metric": {f"{candidate_name}/train/loss": "min"},
            "checkpoint_freq": 1,
            "checkpoint_runway": 2,
            "clear_intermediate_checkpoints": True,
            "clear_all_checkpoints": True,


## Get Test Predictions
test_preds = emmental_model.predict(test_dataloader, return_preds=True)

Is this right way to do it? I'm not sure, how to use upsert_keys, drop_keys and if I'm extracting features correctly? And should i add torch.no_grad() while predicting?

HiromuHota commented 4 years ago


# featurizer.apply(split=0, train=False, parallelism=PARALLEL)

This cannot be commented out. Please execute this so that features are created and stored in Postgres.

how to use upsert_keys, drop_keys and if I'm extracting features correctly?

"keys" in this case means the names of features that are used by Emmental. Suppose N keys are selected (by featurizer.apply(train=True), here is how Featurizer works:

featurizer.upsert_keys(key_names) is the way to tell a newly initialized Featurizer which keys were selected in the training phase.

You can check len(test_cands) X len(key_names) is equal to the shape of F_test.

Regarding torch.no_grad(), I think you can safety do so in the inference phase. (And I have to do the same in fonduer_model.py.) According to https://pytorch.org/docs/stable/generated/torch.no_grad.html

Disabling gradient calculation is useful for inference, when you are sure that you will not call Tensor.backward(). It will reduce memory consumption for computations that would otherwise have requires_grad=True.

saikalyan9981 commented 4 years ago

Thank you for clarifying my doubts @HiromuHota , To make sure I understood it right, first featurizer.apply(split=0, train=True, parallelism=PARALLEL) then drop all the keys present then upsert stored keys then get_feature_matrices

candidateExtractor = CandidateExtractor(session, [candidate])
candidateExtractor.apply(fonduerPipeline.contexts[context], split=0, parallelism=PARALLEL)
test_cands = candidateExtractor.get_candidates(split = 0)
featurizer = Featurizer(session, [candidate])
featurizer.apply(split=0, train=True, parallelism=PARALLEL)
key_names_drop = [key.name for key in featurizer.get_keys()]
F_test = featurizer.get_feature_matrices(test_cands)

@HiromuHota Can you please comment, if this is good?

HiromuHota commented 4 years ago

I'd suggest two changes:

  1. Use split=2 for test data. (That's the convention. 0/1/2 - train/dev/test).
  2. No need to set train=True on test_cands if you drop the trained keys right after training it.

So your code should look like below.

candidateExtractor = CandidateExtractor(session, [candidate])
candidateExtractor.apply(fonduerPipeline.contexts[context], split=2, parallelism=PARALLEL)
test_cands = candidateExtractor.get_candidates(split = 2)
featurizer = Featurizer(session, [candidate])
featurizer.apply(split=2, train=False, parallelism=PARALLEL)
F_test = featurizer.get_feature_matrices(test_cands)

This code assumes that the backend postgres has no key for Featurizer. If not, please drop existing keys before inserting another set of them.

saikalyan9981 commented 4 years ago

@HiromuHota Thanks for the correction In my use case, backend postgres has some keys for Featurizer when I'm using it second time for a different candidate. Although I try to drop keys, some of them aren't getting dropped. But when I drop after train=True, all keys gets dropped.

key_names_drop = [key.name for key in featurizer.get_keys()]
print(len(featurizer.get_keys())) ## prints 2531

featurizer.apply(split=2, train=True, parallelism=PARALLEL)
key_names_drop = [key.name for key in featurizer.get_keys()]
print(len(featurizer.get_keys())) ## prints 0

This is the reason, I'm using train=True and then dropping the keys

HiromuHota commented 4 years ago

@saikalyan9981 Thank you for letting us know the reason behind. This behavior looked strange to me at first but now I can see it more clearly after reading documentations carefully. Nonetheless, this behavior is still very confusing and would require improvement.

Here is why this happens: Featurizer takes a list of candidate classes when it gets initialized. featurizer.drop_keys drops only keys that are associated with this list of candidate classes. Meanwhile featurizer.get_keys() returns a list of all keys that are stored in the database no matter which candidate class they are associated with.

Your code should work as expected, but this session.query(FeatureKey).delete(synchronize_session="fetch") should work too to clear the keys. It is a little hacky, though.