I found this part of implementation in amrlib/evaluate/smatch_enhanced.py:
def compute_subscores(pred, gold):
...
# Loop through all entries
for amr_pred, amr_gold in zip(pred, gold):
...
# Wikification scores
list_pred = wikification(triples_pred)
list_gold = wikification(triples_gold)
inters["Wikification"] += len(list(set(list_pred) & set(list_gold)))
preds["Wikification"] += len(set(list_pred))
golds["Wikification"] += len(set(list_gold))
...
You can see that the calculation of inters["Wikification"] is finding the length of intersection of predset and goldset.
However, these set conversions can't support AMR with multiple nodes which refer to same wiki, since it will be counted as one.
Consider this example. Here is the example file of gold.txt:
( m / multi-sentence
:snt1 ( p / person
:wiki "Barack_Obama"
:name ( n / name
:op1 "Obama" ) )
:snt2 ( p2 / person
:wiki "Barack_Obama"
:name ( n2 / nama
:op1 "Barack"
:op2 "Obama" ) ) )
Example sentence: Obama. Barack Obama.
Here is the example file of pred.txt:
( p / person
:wiki "Barack_Obama"
:name ( n / name
:op1 "Obama" ) )
You can see that there are two wikis in gold.txt, but only one on pred.txt. All wikis refer to the same person. I calculated with those files with compute_scores function. The result is 1.000 for Wikification recall. I'm not sure whether this is expected or not, but I think the result should be 0.500 since it just predict one from two wikis.
If this result is unexpected, I think those set operations should be changed to multi-set, or frequency table, so it supports the case that I've written.
Edit: I only showed wikification case for this issue as an example. This issue is related to all compute_subscores metrics, from Non_sense_frames to Frames.
I take it the problem is specific to multi-sentence, where there are different nodes for the same entity in different sentences. Within a sentence, a wikified entity should only have one variable.
I found this part of implementation in
amrlib/evaluate/smatch_enhanced.py
:You can see that the calculation of
inters["Wikification"]
is finding the length of intersection ofpred
set andgold
set. However, these set conversions can't support AMR with multiple nodes which refer to same wiki, since it will be counted as one.Consider this example. Here is the example file of
gold.txt
:Example sentence: Obama. Barack Obama.
Here is the example file of
pred.txt
:You can see that there are two wikis in
gold.txt
, but only one onpred.txt
. All wikis refer to the same person. I calculated with those files withcompute_scores
function. The result is 1.000 for Wikification recall. I'm not sure whether this is expected or not, but I think the result should be 0.500 since it just predict one from two wikis.If this result is unexpected, I think those set operations should be changed to multi-set, or frequency table, so it supports the case that I've written.
Edit: I only showed wikification case for this issue as an example. This issue is related to all
compute_subscores
metrics, from Non_sense_frames to Frames.