IAHLT / UD_Hebrew

Hebrew Universal Dependencies Treebank
Other
2 stars 2 forks source link

Cross-sentence dependencies #30

Open Hilla-Merhav opened 3 years ago

Hilla-Merhav commented 3 years ago

@amir-zeldes

I wonder what we do in cases the dependencies cross sentences. image

I have in my batch the two last sentences, separately: ניסיון לקבלת דבר במירמה – עבירה לפי סעיף 415 לחוק העונשין, יחד עם סעיף 25 לחוק – בשל ניסיון שעשה להירשם, במירמה, כמייצג מורשה לצורך קבלת יתרונות משלטונות המס. And: ייצוג שלא כדין – עבירה לפי סעיף 216ג + 236(3) לפקודה - משום שייצג נישום בדיון עם מפקח של מס הכנסה.

As I see it, 2 (ניסיון) should be obl governed by 1 (מצא) and 3 (ייצג) should be advcl governed by 1 (מצא). But since these sentences are separated from the main clause, I can’t represent the real dependencies. I wonder if I should mark these sentences as corrupt, because IMO there is no internal dependency within these sentence that seems "correct". What do you think?

amir-zeldes commented 3 years ago

Many TBs force a sentence split due to block elements like paragraphs or bullet points. If they do, you have two options:

Or you can make it all into one giant sentence... But it probably won't be the last case where you are 'missing' some elliptical verb which appears to dominate the overt arguments.

Hilla-Merhav commented 3 years ago

@amir-zeldes

Which of these options you recommend? Also, sometimes I meet a very similar challenge when the first part of a sentence is cut inconveniently; an example from my current batch:

כנקודת מוצא, עם כניסתו של החוק לתוקף, נקבע שסל הבריאות הבסיסי יכלול את:

(and below there is a list of bullets - I have also got the bullets and analyzed them, but each one comes separately) Since 'et' is supposed to be governed by the obj through case, I wonder how I should deal with it.

amir-zeldes commented 3 years ago

Probably option 1 is less avant garde, so maybe go with that. In which case the "et" example would undergo normal promotion if the sentence is split (so "et" itself would become obj)

Hilla-Merhav commented 3 years ago

@amir-zeldes If the parser mistakenly took the PUNCT from the end of the previous sentence, so it opens the current sentence, is this something we want to report to prevent the recurrence of such a segmentation? I assume that if the segmentation is incorrect, we want to report it, but I wonder how to report it (corrupt?). How can we improve the cross-sentence segmentation performance with our data? Or can we just mark it as a referendum and analyze the sentence as usual?

amir-zeldes commented 3 years ago

No, sentence segmentation errors are not reparandum, they should be fixed properly. The current sentence segmentation is based on training from HTB, where almost all sentences end in an easily identifiable period. To improve segmentation on web data or other domains, we would need to retrain the sentence splitter.

If @ivrit or @yifatbm have a clean subset of data for this purpose I can retrain the sentencer, or also other components (keeping in mind that HTB is still in the train set, so I can only retrain, for example the parser, if any new labels in the new data have already been retrofitted into the IAHLT HTB)