Open saikalyan9981 opened 4 years ago
@saikalyan9981
# featurizer.apply(split=0, train=False, parallelism=PARALLEL)
This cannot be commented out. Please execute this so that features are created and stored in Postgres.
how to use upsert_keys, drop_keys and if I'm extracting features correctly?
"keys" in this case means the names of features that are used by Emmental.
Suppose N
keys are selected (by featurizer.apply(train=True
), here is how Featurizer
works:
featurizer.apply
creates features from candidates with no filter.featurizer.get_feature_matrices
filters features using the "keys" and returns a feature matrix (M x N
) for M candidates. featurizer.upsert_keys(key_names)
is the way to tell a newly initialized Featurizer
which keys were selected in the training phase.
You can check len(test_cands)
X len(key_names)
is equal to the shape of F_test
.
Regarding torch.no_grad()
, I think you can safety do so in the inference phase. (And I have to do the same in fonduer_model.py.)
According to https://pytorch.org/docs/stable/generated/torch.no_grad.html
Disabling gradient calculation is useful for inference, when you are sure that you will not call Tensor.backward(). It will reduce memory consumption for computations that would otherwise have requires_grad=True.
Thank you for clarifying my doubts @HiromuHota ,
To make sure I understood it right,
first featurizer.apply(split=0, train=True, parallelism=PARALLEL)
then drop all the keys present
then upsert stored keys
then get_feature_matrices
candidateExtractor = CandidateExtractor(session, [candidate])
candidateExtractor.apply(fonduerPipeline.contexts[context], split=0, parallelism=PARALLEL)
test_cands = candidateExtractor.get_candidates(split = 0)
featurizer = Featurizer(session, [candidate])
featurizer.apply(split=0, train=True, parallelism=PARALLEL)
key_names_drop = [key.name for key in featurizer.get_keys()]
featurizer.drop_keys(key_names_drop)
featurizer.upsert_keys(key_names)
F_test = featurizer.get_feature_matrices(test_cands)
@HiromuHota Can you please comment, if this is good?
I'd suggest two changes:
train=True
on test_cands if you drop the trained keys right after training it.So your code should look like below.
candidateExtractor = CandidateExtractor(session, [candidate])
candidateExtractor.apply(fonduerPipeline.contexts[context], split=2, parallelism=PARALLEL)
test_cands = candidateExtractor.get_candidates(split = 2)
featurizer = Featurizer(session, [candidate])
featurizer.apply(split=2, train=False, parallelism=PARALLEL)
featurizer.upsert_keys(key_names)
F_test = featurizer.get_feature_matrices(test_cands)
This code assumes that the backend postgres has no key for Featurizer. If not, please drop existing keys before inserting another set of them.
@HiromuHota Thanks for the correction In my use case, backend postgres has some keys for Featurizer when I'm using it second time for a different candidate. Although I try to drop keys, some of them aren't getting dropped. But when I drop after train=True, all keys gets dropped.
key_names_drop = [key.name for key in featurizer.get_keys()]
featurizer.drop_keys(key_names_drop)
print(len(featurizer.get_keys())) ## prints 2531
featurizer.apply(split=2, train=True, parallelism=PARALLEL)
key_names_drop = [key.name for key in featurizer.get_keys()]
featurizer.drop_keys(key_names_drop)
print(len(featurizer.get_keys())) ## prints 0
This is the reason, I'm using train=True and then dropping the keys
@saikalyan9981 Thank you for letting us know the reason behind. This behavior looked strange to me at first but now I can see it more clearly after reading documentations carefully. Nonetheless, this behavior is still very confusing and would require improvement.
Here is why this happens:
Featurizer
takes a list of candidate classes when it gets initialized.
featurizer.drop_keys
drops only keys that are associated with this list of candidate classes.
Meanwhile featurizer.get_keys()
returns a list of all keys that are stored in the database no matter which candidate class they are associated with.
Your code should work as expected, but this session.query(FeatureKey).delete(synchronize_session="fetch")
should work too to clear the keys. It is a little hacky, though.
I have gone through the code of packaging in ML Flow. Thank you, It was very useful for me. While testing, I think the code here hardware_fonduer_model classifies one document a time. However I would like to test in multiple documents at once:
So Is this code snippet correct, to test the model
Is this right way to do it? I'm not sure, how to use upsert_keys, drop_keys and if I'm extracting features correctly? And should i add torch.no_grad() while predicting?