cu-clear / semlink

Official repository for Semlink resources
32 stars 10 forks source link

Coverage of PB-VN mapping is not a strict subset of VerbNet-derived mapping #7

Open aaronstevenwhite opened 2 years ago

aaronstevenwhite commented 2 years ago

When comparing pb-vn2.json to a mapping roleset-class mapping derived from VerbNet3.4 itself, I find that the ProbBank rolesets in the domain of each mapping are not in a subset relation with each other as might be expected.

To derive the mapping from VerbNet3.4, I use:

from collections import defaultdict
from verbnet import VerbNetParser

verbnet = VerbNetParser(version="3.4")

pb_vn34_map = defaultdict(set)

for cid, clsinfo in verbnet.verb_classes_numerical_dict.items():
    for m in clsinfo.members:
        for pbroleset in m.grouping:
            pb_vn34_map[pbroleset] |= {cid}

pb_vn34_map = dict(pb_vn34_map)

When compared to pb-vn2.json...

with open('semlink/instances/pb-vn2.json') as f:
    semlink_map = json.load(f)

pbset_from_verbnet = set(pb_vn34_map)
pbset_from_semlink = set(semlink_map)

print('In both SemLink and VerbNet:\t', len(pbset_from_semlink & pbset_from_verbnet))
print('In VerbNet but not SemLink:\t', len(pbset_from_verbnet - pbset_from_semlink))
print('In SemLink but not VerbNet:\t', len(pbset_from_semlink - pbset_from_verbnet))

I observe the following counts:

In both SemLink and VerbNet:     1854
In VerbNet but not SemLink:  1360
In SemLink but not VerbNet:  2323
kevincstowe commented 2 years ago

I see, yes, there a couple of issues here. First, for semlink, we are also using an external file of pb-vn mappings that was generated for a separate project. It isn't directly from either resource, causing some of the disjoint.

The other, more pressing issue, is that PB and VN both seem to have ideas about what they map to. For SemLink, we trust PB: the mappings come from what PB says, and from the external file. It's likely the case that VN has more, valid mappings that we could include. Unfortunately it's probably also likely that they conflict in some places. We'll have to do a little study to find where VN's mappings to PB conflict with PB's to VN, where the disjoint is, and how we can expand coverage. @ghamzak is this something CU could look in to?

For now, I think all I can say is that we trust SemLink (and thus PB) wrt. mappings - anything that looked suspicious was removed in the automated process, and the PB mapping should then be valid.

aaronstevenwhite commented 2 years ago

Thanks for the quick reply. Is there any information on how the PB-VN mapping linked above was generated? We are using these mappings in an analysis for a paper, and while it's straightforward to just point to VN3.4 for the mappings that can be derived from it, I'm a bit worried about using the above without being able to cite their provenance. I'm assuming they're not strictly from PB, since PB only contains PB-VN3.2 mappings and some of the classes have been renamed or split in VN3.4 (the original reason I contacted @ghamzak back in March: I had extracted PB-VN3.2 from PB, and was looking for a mapping from VN3.2 to VN3.4 to compose with it).

MarthaSPalmer commented 2 years ago

We finished a manual update of all the PB-VN3.4 mappings about 2 years ago and are in the process of incorporating it into a planned new release of PB which is taking quite a bit pinger than we had anticipated. I believe that is where those mappings came from.

Martha

On Jul 27, 2021, at 5:28 AM, Aaron Steven White @.***> wrote:



Thanks for the quick reply. Is there any information on how the PB-VN mapping linked above was generated? We are using these mappings in an analysis for a paper, and while it's straightforward to just point to VN3.4 for the mappings that can be derived from it, I'm a bit worried about using the above without being able to cite their provenance. I'm assuming they're not strictly from PB, since PB only contains PB-VN3.2 mappings and some of the classes have been renamed or split in VN3.4 (the original reason I contacted @ghamzakhttps://github.com/ghamzak back in March: I had extracted PB-VN3.2 from PB, and was looking for a mapping from VN3.2 to VN3.4 to compose with it).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/cu-clear/semlink/issues/7#issuecomment-887434157, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABB327UCWIPJFH6ZTELCRATTZ2J6XANCNFSM5BAY6S6A.

aaronstevenwhite commented 2 years ago

Thanks for the quick reply, @MarthaSPalmer.

MarthaSPalmer commented 2 years ago

pinger -> longer!

Whew - maybe too fast!

Martha

On Jul 27, 2021, at 10:37 AM, Aaron Steven White @.**@.>> wrote:

Thanks for the quick reply, @MarthaSPalmerhttps://github.com/MarthaSPalmer.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/cu-clear/semlink/issues/7#issuecomment-887662769, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABB327WUZLWJNHGBLTXNRWLTZ3OE5ANCNFSM5BAY6S6A.

aaronstevenwhite commented 2 years ago

Even with the updated mapping, I am running into mismatches. I believe it should be the case that if there is a semlink mapping from a PB roleset to a VN class in pv-vn2.json, the VN class is guaranteed to be in VN3.4, but this does not appear to be the case. For instance, if I compute the set of VN classes from VerbNet and the set of VN classes that semlink maps that at least one PB roleset maps to as follows...

from verbnet import VerbNetParser

verbnet = VerbNetParser(version="3.4")

with open('semlink/instances/pb-vn2.json') as f:
    semlink = json.load(f)

verbnet_classes = set(verbnet.verb_classes_numerical_dict)
semlink_classes = {vncls for pbroleset, vnclasses in semlink.items() for vncls in vnclasses}

...and then calculate len(semlink_classes - verbnet_classes), I get 52 classes that are mapped to in semlink but are not found in VN3.4. I've included a list below.

31.3-3
39.1-2
72
9.3-2
28
31.3-1
90-1
47.1-1-1
29.5-1-2
45.6
31.3-7
13.2-1-1-1
13.3-1
92.1
100
36.1
107
27
51.1-2
23.1-2
31.3-2
31.3-8
49
31.3-9
9.3-2-1
39.1-3
40.3.1-1
105
26.7-1-1
10.6
26.2
13.6
37.7-2
26.7-2-1
61
31.4-3
59
39.4-1
64
39.3-2
47.5.1-2-1
37.4
39.2-1
51.1-3
39.2-2
31.3-6
39.4-2
22.4-1
33
95
31.3-5
9.2-1

At least some of these (e.g. 10.6, 72, 105)—maybe all of them—are classes and subclasses that are only found in VN3.2. Indeed, when doing the analysis I'm trying to use this for originally, these classes were exactly the mismatching ones that triggered my initial request for a VN3.2 to VN3.4. At that point, I actually just went through and hand-corrected the mappings on a by-predicate basis as best I could, but it would be really nice to have a canonical mapping that maps to only VN3.4, since my hand-corrected mapping could be wrong in places.

aaronstevenwhite commented 2 years ago

It's my understanding that external_vn2pb.json should have as keys only VN3.4 classes. This does not appear to be the case. When I load that mapping and compare it to the list of classes extracted directly from VN3.4 as in my previous post...

with open('semlink/other_resources/external_vn2pb.json') as f:
    external_vn2pb = json.load(f)

# get the numeric identifier for each class
external_vn2pb_classes = {'-'.join(c.split('-')[1:]) for c in external_vn2pb} - 

external_vn2pb_classes - verbnet_classes

I get 12 classes found in external_vn2pb.json but not found in VN3.4, all of which are subclasses .

39.1-3
39.1-2
30.1-1-1-1
13.3-1
39.2-1
31.4-3
23.1-2
39.2-2
22.4-1
39.3-2
9.2-1
47.5.1-2-1

These all appear to be instances where the base class exists in VN3.4 but the subclass doesn't. Maybe these were cases for which an early version of VN3.4 subclassed an existing class but where that subclass was deleted or promoted to its own class?

The above would explain some of the mismatches mentioned in the above post, but there still remain 40 classes in pb-vn2.json that are not explained by this mismatch.

chaitanyamalaviya commented 2 years ago

Hi, I'm running into the same issues as highlighted above, i.e., classes linked to in pb-vn2.json don't exist in VerbNet. Would appreciate a response. Thanks!

kevincstowe commented 2 years ago

Sorry for the delay, but I'm looking into it now. One thing is that the current version is based on VN3.3, rather than 3.4. I don't know if that accounts for all of the mismatches though. I can say the external_vn2pb.json was built separately, and not linked the 3.3 even, but the incorrect classes should be filtered outwhen semlink is generated. I'll update when I've found out more.

UPDATE: It's pulling 3.4, so that shouldn't be an issue. But it looks like there was a bug where it wasn't correctly filtering/updating incorrect PB mappings. Fixing and rerunning now.

UPDATE: It appears that was, in fact, the issue. I implemented your test @aaronstevenwhite, and it now returns 0. This test is now included so we can check if these errors are popping up in the future. Note that the external external_vn2pb.json still is NOT current: it's an older, manually created file, and only supplements these resources. The pb2vn system takes this file as additional information, corrects it where possible, uses where correct, and removes where incorrect (as it does with the PropBank frame file mappings).