Closed Omarito2412 closed 4 years ago
Doing my HW a bit early :)
I like the work seems they did the obvious. My only critics:
Dataleak Models are trained and evaluated over the same Wikipedia sentences (T-rex,squad,GoogleRE are wikipedia) Maybe we are not evaluating the model comprehension of sentences but just its capacity of memorizing sequences and retrieving sequences.
This work doesn't evaluate the LM capacity of doing inference. one easy way that could have been done is to evaluate the model capacity of propagating the same info about all entities of the same kind, given it is mentioned at least once and not for the rest in the training text.
Follow up work: Negated LAMA: Birds cannot fly https://arxiv.org/abs/1911.03343 BERT is Not a Knowledge Base (Yet): Factual Knowledge vs. Name-Based Reasoning in Unsupervised QA
The paper compares publicly available models (as-is), which is in my opinion a bit unfair or misleading, because they compare different architectures (but they are trained on different corpses), which might give one architecture an advantage over the other if it's trained on more data, so we can't really tell how much an architecture is better than the other.
The second issue is the potential overlap/dataleak between the training and evaluation data, as BERT for example is trained on Wikipedia, and the evaluation datasets are also built on top of Wikipedia, so the model maybe just good at memorizing training samples, not generalizing to unseen data (few-shot/zero-shot learning).
The third issue, which is the very limiting constraint of using single token facts, which I guess might be an overkill to use gigantic language models to solve, as it can be solved by simple indexing mechanism (elasticsearch), it would have been really interesting insight to see how BERT (and others) works on multi-token facts.
Would be interesting to see how REALM (and other knowledge augmented LMs) behaves on this evaluation setup, my hypothesis that it would outperform BERT by a large margin.
Figure 3 (Pearson correlation between different factors) is really intuitive and interesting, it makes sense the that the top correlated factors with the precision is the similarity between the subject and object vectors, and BERT log-prob score for the first prediction (BERT is really confident about it's correct predictions).
BERT (base and large) doesn't perform well on numeric facts, such as birth-date
relation from GoogleRE.
I like this paper's analysis, I think it's straightforward. What they're trying to do is this:
The authors do not mention this, but this task can be thought of from two directions:
I think the paper's analysis is more aligned with point 1. The authors state a few constraints they face in their analysis like having the models trained on different vocabularies and how different parameters like beam size can effect model prediction and that's why they tend to relax as much of these constraints as possible.
They show that there's a correlation between subject mentions, entity mention with Precision@1 for BERT which can be interpreted as: BERT has already seen given examples before
which makes it more of a retrieval assessment rather than an inference task.
I'm interested in reading more work that builds upon this analysis, and I'm sticking to the assumption that the analysis is to assess the retrieval capabilities of pre-trained LMs from text corpora.
Join us at our discussion of Language Models as Knowledge Bases? on Sunday 12th of April. Paper's link: https://arxiv.org/abs/1909.01066
Hangout: https://hangouts.google.com/group/kUxBAunjGittAkBUA