apertium / apertium-uzb

Apertium linguistic data for Uzbek
GNU General Public License v3.0
6 stars 12 forks source link

Lexc improvements #11

Closed elmurod1202 closed 3 years ago

jonorthwash commented 4 years ago

The new names seem to have extra spaces. Did you check that it compiles before committing?

elmurod1202 commented 4 years ago

The new names seem to have extra spaces. Did you check that it compiles before committing?

Oh, my bad, just fixed it. I was quick to push changes as soon as it compiled without an error, not noticing the spaces.

jonorthwash commented 4 years ago

I'm curious about these forms; what are they?

{+ad:ad NP-AL ; !+}
{+al:al NP-AL ; !+}
{+am:am NP-AL ; !+}
{+bei:bei NP-AL ; !+
{+ibn:ibn NP-AL ; !+}
{+la:la NP-AL ; !+}
{+le:le NP-AL ; !+}
{+les:les NP-AL ; !+}
{+lès:lès NP-AL ; !+}
jonorthwash commented 4 years ago

I'm curious about these forms; what are they?

{+ad:ad NP-AL ; !+}
{+al:al NP-AL ; !+}
{+am:am NP-AL ; !+}
{+bei:bei NP-AL ; !+
{+ibn:ibn NP-AL ; !+}
{+la:la NP-AL ; !+}
{+le:le NP-AL ; !+}
{+les:les NP-AL ; !+}
{+lès:lès NP-AL ; !+}

Oh, nevermind, these are forms you deleted, right?

jonorthwash commented 4 years ago

Why did you move to U02BC? Uzbek Wikipedia uses U02BB, which I thought was the standard? (I agree that the former looks better and is more appropriate, but it appears not to be what's used.)

jonorthwash commented 4 years ago

The roman numerals should probably have their own lexicon (pointed to directly from Root) and not be mixed into the main lexicon.

jonorthwash commented 4 years ago

Forms like this should be able to be generated automatically:

Chernishov:Chernishov  NP-COG-MF ; ! "" ! El++
Chernishova:Chernishova  NP-COG-MF ; ! "" ! El++

Forms like this appear to be patronymics, which can also be generated for <m> and <f> forms automatically:

Cholakovich:Cholakovich  NP-COG-MF ; ! "" ! El++
elmurod1202 commented 4 years ago

I'm curious about these forms; what are they?

{+ad:ad NP-AL ; !+}
{+al:al NP-AL ; !+}
{+am:am NP-AL ; !+}
{+bei:bei NP-AL ; !+
{+ibn:ibn NP-AL ; !+}
{+la:la NP-AL ; !+}
{+le:le NP-AL ; !+}
{+les:les NP-AL ; !+}
{+lès:lès NP-AL ; !+}

These would be parts of NP-ORGS, no?. Why I added them is because, first of all, they exist in turkish.lexc(and all the time I consider tur.lexc as an ideal package and try to replicate what's there), second of all, they help improve the coverage. Was it wrong?(If so, I'll fix them from both Turkish and Uzbek)

elmurod1202 commented 4 years ago

Why did you move to U02BC? Uzbek Wikipedia uses U02BB, which I thought was the standard? (I agree that the former looks better and is more appropriate, but it appears not to be what's used.)

The shortest answer is: I didn't. Explanation: The apostrophe in <oʻ> and <gʻ> are U02BB and I converted all to that, there is another apostrophe in the alphabet: "tutuq belgisi"(phonetic glottal stop, <ъ> in cyrillic script) and itʻs U02BC,(Ex: aʼlo, maʼno, eʼzoz) I fixed them as well.

elmurod1202 commented 4 years ago

The roman numerals should probably have their own lexicon (pointed to directly from Root) and not be mixed into the main lexicon.

This way of announcing roman numerals was also copied from apertium-tur. Now I see what you(that was you, wasn't?) did in apertium-kaz/tat do deal with it (<(M | D | C | L | X | V | I)+> NUM-ROMAN ;). That looks more appropriate, I'll fix both tur and uzb if you confirm.

jonorthwash commented 4 years ago

I'm curious about these forms; what are they?

{+ad:ad NP-AL ; !+}
{+al:al NP-AL ; !+}
{+am:am NP-AL ; !+}
{+bei:bei NP-AL ; !+
{+ibn:ibn NP-AL ; !+}
{+la:la NP-AL ; !+}
{+le:le NP-AL ; !+}
{+les:les NP-AL ; !+}
{+lès:lès NP-AL ; !+}

These would be parts of NP-ORGS, no?. Why I added them is because, first of all, they exist in turkish.lexc(and all the time I consider tur.lexc as an ideal package and try to replicate what's there), second of all, they help improve the coverage. Was it wrong?(If so, I'll fix them from both Turkish and Uzbek)

  1. I'm not sure it's a good idea to always copy other Apertium pairs. tur.lexc is generally in better shape than uzb.lexc, but it's far from perfect.

  2. I'm not sure if we need these forms, but it might not hurt. I don't know. Let's not worry about it now, maybe...

jonorthwash commented 4 years ago

Why did you move to U02BC? Uzbek Wikipedia uses U02BB, which I thought was the standard? (I agree that the former looks better and is more appropriate, but it appears not to be what's used.)

The shortest answer is: I didn't. Explanation: The apostrophe in <oʻ> and <gʻ> are U02BB and I converted all to that, there is another apostrophe in the alphabet: "tutuq belgisi"(phonetic glottal stop, <ъ> in cyrillic script) and itʻs U02BC,(Ex: aʼlo, maʼno, eʼzoz) I fixed them as well.

Oh! I didn't know these two were meant to be encoded differently! So, how do we know what the right encoding of them is?

jonorthwash commented 4 years ago

The roman numerals should probably have their own lexicon (pointed to directly from Root) and not be mixed into the main lexicon.

This way of announcing roman numerals was also copied from apertium-tur. Now I see what you(that was you, wasn't?) did in apertium-kaz/tat do deal with it (<(M | D | C | L | X | V | I)+> NUM-ROMAN ;). That looks more appropriate, I'll fix both tur and uzb if you confirm.

That was probably @IlnarSelimcan, and I assume that's right and should be okay to copy.

elmurod1202 commented 4 years ago

Why did you move to U02BC? Uzbek Wikipedia uses U02BB, which I thought was the standard? (I agree that the former looks better and is more appropriate, but it appears not to be what's used.)

The shortest answer is: I didn't. Explanation: The apostrophe in <oʻ> and <gʻ> are U02BB and I converted all to that, there is another apostrophe in the alphabet: "tutuq belgisi"(phonetic glottal stop, <ъ> in cyrillic script) and itʻs U02BC,(Ex: aʼlo, maʼno, eʼzoz) I fixed them as well.

Oh! I didn't know these two were meant to be encoded differently! So, how do we know what the right encoding of them is?

Wiki page on "Uzbek alphabet" says that U02BB is an apostrophe used for <oʻ> and <gʻ>, while U02BC is used for "tutuq belgisi"(aka glottal stop?). I fixed all apostrophes in uzb.lexc in my branch to what they are supposed to be. Yet there is a problem still standing is that due to the lack of proper keyboard layout for Uzbek alphabet, those apostrophes appear in varieties of forms, it has to be solved though (Issue #2).

ftyers commented 4 years ago

The roman numerals should probably have their own lexicon (pointed to directly from Root) and not be mixed into the main lexicon.

This way of announcing roman numerals was also copied from apertium-tur. Now I see what you(that was you, wasn't?) did in apertium-kaz/tat do deal with it (<(M | D | C | L | X | V | I)+> NUM-ROMAN ;). That looks more appropriate, I'll fix both tur and uzb if you confirm.

That was probably @IlnarSelimcan, and I assume that's right and should be okay to copy.

This might be problematic for the "entry beginning with whitespace" bug. You might want to use [ ... ] instead of ( ... ).