Open Hilla-Merhav opened 3 years ago
Many TBs force a sentence split due to block elements like paragraphs or bullet points. If they do, you have two options:
orphan
, basically saying that there is an elliptical copy of the previous main sentence verb in each of the bullet point sentencesOr you can make it all into one giant sentence... But it probably won't be the last case where you are 'missing' some elliptical verb which appears to dominate the overt arguments.
@amir-zeldes
Which of these options you recommend? Also, sometimes I meet a very similar challenge when the first part of a sentence is cut inconveniently; an example from my current batch:
כנקודת מוצא, עם כניסתו של החוק לתוקף, נקבע שסל הבריאות הבסיסי יכלול את:
(and below there is a list of bullets - I have also got the bullets and analyzed them, but each one comes separately)
Since 'et' is supposed to be governed by the obj
through case
, I wonder how I should deal with it.
Probably option 1 is less avant garde, so maybe go with that. In which case the "et" example would undergo normal promotion if the sentence is split (so "et" itself would become obj
)
@amir-zeldes If the parser mistakenly took the PUNCT from the end of the previous sentence, so it opens the current sentence, is this something we want to report to prevent the recurrence of such a segmentation? I assume that if the segmentation is incorrect, we want to report it, but I wonder how to report it (corrupt?). How can we improve the cross-sentence segmentation performance with our data? Or can we just mark it as a referendum and analyze the sentence as usual?
No, sentence segmentation errors are not reparandum, they should be fixed properly. The current sentence segmentation is based on training from HTB, where almost all sentences end in an easily identifiable period. To improve segmentation on web data or other domains, we would need to retrain the sentence splitter.
If @ivrit or @yifatbm have a clean subset of data for this purpose I can retrain the sentencer, or also other components (keeping in mind that HTB is still in the train set, so I can only retrain, for example the parser, if any new labels in the new data have already been retrofitted into the IAHLT HTB)
@amir-zeldes
I wonder what we do in cases the dependencies cross sentences.
I have in my batch the two last sentences, separately: ניסיון לקבלת דבר במירמה – עבירה לפי סעיף 415 לחוק העונשין, יחד עם סעיף 25 לחוק – בשל ניסיון שעשה להירשם, במירמה, כמייצג מורשה לצורך קבלת יתרונות משלטונות המס. And: ייצוג שלא כדין – עבירה לפי סעיף 216ג + 236(3) לפקודה - משום שייצג נישום בדיון עם מפקח של מס הכנסה.
As I see it, 2 (ניסיון) should be
obl
governed by 1 (מצא) and 3 (ייצג) should beadvcl
governed by 1 (מצא). But since these sentences are separated from the main clause, I can’t represent the real dependencies. I wonder if I should mark these sentences as corrupt, because IMO there is no internal dependency within these sentence that seems "correct". What do you think?