Closed codykingham closed 4 years ago
Same question about small letter y (ʸ, 15x). Can we normalize this to "y"?
Dear Cody,
Yes, for the purpose of the database it is perfectly fine to normalize these characters as you suggest.
Best wishes
Geoffrey
On 30/03/2020 11:39, Cody Kingham wrote:
in the corpus we get these two characters that occur pretty rarely: Alpha “a” (ɑ, 96x) and dotless i (ı, 9x).
My understanding is that these represent characters in a foreign word. We already have a feature |foreign| that indicates a word is in a foreign language. 103/109 words containing an alpha or dotless "i" are already marked as |foreign| (based on lack of italics in original MS document source).
6/109 words containing an alpha are not marked as foreign for some reason (numbers are TF nodes):
|[(774877, '⁺tɑ̄̀n '), (774924, '⁺Hayə̀stɑn,ˈ '), (774929, '⁺tɑn='), (775422, 'Téhrɑn '), (813725, 'Šɑ̄h '), (813726, 'Abbɑ̄̀sˈ ')] |
@GeoffreyKhan https://github.com/GeoffreyKhan Can we go ahead and normalize alpha and dotless "i" to "a" and "i" (respectively)? Is there any reason to keep these characters. I'm developing a list of acceptable characters in our database, and it would be nice to normalize these. I can automatically adjust all cases. Then I can add the |foreign| feature to these 6 words that are lacking it.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CambridgeSemiticsLab/nena_corpus/issues/8, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMC4DGZG7U4KOKCBEMIPHATRKBZEPANCNFSM4LWPGNFA.
-- Geoffrey Khan Regius Professor of Hebrew University of Cambridge
Faculty of Asian and Middle Eastern Studies Sidgwick Avenue Cambridge CB3 9DA UK
Now done in a137d37403e9b650ef24bdda0d5715fe45b2e39b.
in the corpus we get these two characters that occur pretty rarely: Alpha “a” (ɑ, 96x) and dotless i (ı, 9x).
My understanding is that these represent characters in a foreign word. We already have a feature
foreign
that indicates a word is in a foreign language. 103/109 words containing an alpha or dotless "i" are already marked asforeign
(based on lack of italics in original MS document source).6/109 words containing an alpha are not marked as foreign for some reason (numbers are TF nodes):
@GeoffreyKhan Can we go ahead and normalize alpha and dotless "i" to "a" and "i" (respectively)? Is there any reason to keep these characters. I'm developing a list of acceptable characters in our database, and it would be nice to normalize these. I can automatically adjust all cases. Then I can add the
foreign
feature to these 6 words that are lacking it.