IAHLT / UD_Hebrew

Hebrew Universal Dependencies Treebank
Other
2 stars 2 forks source link

The analysis of הַכֹּל #22

Open Hilla-Merhav opened 3 years ago

Hilla-Merhav commented 3 years ago

@amir-zeldes

I'm working on a list of things we intend to fix in the HTB, so that later this year, when we get to the stage of amending the HTB, we'll just have to address this list. I think the inconsistent analysis of the definite "כֹּל" is one of the things we should fix. What I mean are the cases of the nominal כֹּל, (not כל in det position). For example: הכֹּל בסדר, ידו בַּכֹּל, אוכלי כֹּל. In the HTB, when כל" (/"כול")" is definite but doesn't come after any ADP, in the HTB it is analyzed as one unsegmented token"הכל".

In this way, not only we lose the information of the definiteness (which feels necessary especially when כל is an obj - ("אתן את הכל"), but also creates an inconsistency with occurrences in which the definite article becomes unified with an ADP. In the HTB we have: התקשורת אשמה בַּ+כֹּל מעל לַ+כֹּל And it can reoccur in expressions as ידו בַּ+כֹּל In these cases, the definiteness – which is reported through the ADP features – is detached from the noun, unlike cases where there is no ADP.

This analysis also creates inconsistency with occurrences in which the definite article doesn't exist at all. In the HTB we have: קודם כֹּל יותר מִ+כֹּל And it can reoccur in expressions as: שועלים הם אוכלי כֹּל מְסַפֵּר יודע כֹּל

Do you agree if I add this to the "amending the HTB" list I mentioned, so we (the team) can correct and segment the occurrences of unsegmented "הכל" on the HTB?

amir-zeldes commented 3 years ago

Hm, this is a pretty radical change, but I can't fault your logic at all, and as long as we fix HTB I guess that should be fine. You have my support! BTW this kind of deterministic splitting can probably be done fully automatically (I mean targeting the single token "ha-kol", splitting it automatically, making part 2 the head with whatever deprel it already had, then adding the det and definiteness info). There is no human decision needed here IMO.

Hilla-Merhav commented 2 years ago

this kind of deterministic splitting can probably be done fully automatically

Nick worte a command to auto-correct this, as you suggested, and it works! :)

amir-zeldes commented 2 years ago

Wonderful! Can he apply it to the current IAHLT HTB dev branch as well and either push or PR to dev if he doesn't have push access?

Hilla-Merhav commented 2 years ago

Nick said he can do this but he doesn't know what repository you refer to; he is asking what should he clone.

amir-zeldes commented 2 years ago

This very repository ^

:)