CambridgeSemiticsLab / nena_corpus

The NENA corpus in plain-text markup
Creative Commons Attribution 4.0 International
2 stars 2 forks source link

Rare characters alpha, dotless i, small y #8

Closed codykingham closed 4 years ago

codykingham commented 4 years ago

in the corpus we get these two characters that occur pretty rarely: Alpha “a” (ɑ, 96x) and dotless i (ı, 9x).

My understanding is that these represent characters in a foreign word. We already have a feature foreign that indicates a word is in a foreign language. 103/109 words containing an alpha or dotless "i" are already marked as foreign (based on lack of italics in original MS document source).

6/109 words containing an alpha are not marked as foreign for some reason (numbers are TF nodes):

[(774877, '⁺tɑ̄̀n '),
 (774924, '⁺Hayə̀stɑn,ˈ '),
 (774929, '⁺tɑn='),
 (775422, 'Téhrɑn '),
 (813725, 'Šɑ̄h '),
 (813726, 'Abbɑ̄̀sˈ ')]

@GeoffreyKhan Can we go ahead and normalize alpha and dotless "i" to "a" and "i" (respectively)? Is there any reason to keep these characters. I'm developing a list of acceptable characters in our database, and it would be nice to normalize these. I can automatically adjust all cases. Then I can add the foreign feature to these 6 words that are lacking it.

codykingham commented 4 years ago

Same question about small letter y (ʸ, 15x). Can we normalize this to "y"?

GeoffreyKhan commented 4 years ago

Dear Cody,

Yes, for the purpose of the database it is perfectly fine to normalize these characters as you suggest.

Best wishes

Geoffrey

On 30/03/2020 11:39, Cody Kingham wrote:

in the corpus we get these two characters that occur pretty rarely: Alpha “a” (ɑ, 96x) and dotless i (ı, 9x).

My understanding is that these represent characters in a foreign word. We already have a feature |foreign| that indicates a word is in a foreign language. 103/109 words containing an alpha or dotless "i" are already marked as |foreign| (based on lack of italics in original MS document source).

6/109 words containing an alpha are not marked as foreign for some reason (numbers are TF nodes):

|[(774877, '⁺tɑ̄̀n '), (774924, '⁺Hayə̀stɑn,ˈ '), (774929, '⁺tɑn='), (775422, 'Téhrɑn '), (813725, 'Šɑ̄h '), (813726, 'Abbɑ̄̀sˈ ')] |

@GeoffreyKhan https://github.com/GeoffreyKhan Can we go ahead and normalize alpha and dotless "i" to "a" and "i" (respectively)? Is there any reason to keep these characters. I'm developing a list of acceptable characters in our database, and it would be nice to normalize these. I can automatically adjust all cases. Then I can add the |foreign| feature to these 6 words that are lacking it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CambridgeSemiticsLab/nena_corpus/issues/8, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMC4DGZG7U4KOKCBEMIPHATRKBZEPANCNFSM4LWPGNFA.

-- Geoffrey Khan Regius Professor of Hebrew University of Cambridge

Faculty of Asian and Middle Eastern Studies Sidgwick Avenue Cambridge CB3 9DA UK

codykingham commented 4 years ago

Now done in a137d37403e9b650ef24bdda0d5715fe45b2e39b.