Annotation of Classifiers in the Egyptian-UJaen Treebank

UD-Egyptian commented 5 months ago

Dear colleagues,

Prof. Marco Passarotti and I were discussing about the annotation of classifiers in UD today and we think that this topic should be discussed here as well. There is the DEPREL clf for classifier in UD. It is defined as "a word which accompanies a noun in certain grammatical contexts". In Egyptian, classifiers are not words, but signs that provide general or specific information about the word they accompany (see example 1 in the attached file). As this information is not phonetic, but semantic, I did not annotate them in the first release of the Egyptian treebank. But they have the same function as classifiers in other languages or writing systems, such as Chinese and Akkadian. The question now is how to annotate them in UD. It seems to me that there are two possibilities:

1) The use of the Gardiner list which contains a classification of hieroglyphs that gives an ID consisting of a letter + a number for each hieroglyph, for example the ID of the sky is N1. The word p.t "sky" could be transcribed as p.t(N1). The classifier could be annotated as followed:

1 p.t p.t NOUN Gender=Fem|Number=Sing 0 root SpaceAfter=No 2 (N1) (N1) SYM Animacy=Inan 1 clf _

Problems with this annotation: a) Technically, Egyptian classifiers are not symbols. However, they have some functions in common with symbols. This is just a terminological problem. b) There are many signs used as classifiers which are not registered in the Gardiner list. When this happens, I use (&) for the sign without an ID. To solve this problem, I think that I could publish a list with the new classifiers in the repository of the Egyptian treebank. The ID of each classifier should be the abbreviation of the source (e.g. PT for Pyramid Texts) and an ordinal number (see example 2 in the attached file). But I am not sure about this meassure. c) Classifiers are sometimes written between the stem of a verb and its ending (see example 3 in the attached file). It would be perfect if one could annotate in UD the classifier between the verb stem and its ending. But I ignore how to do it, so I have annotated classifiers at the end of the verb form, for example i҆bꜣ.tn(Y6) ⸗f instead of i҆bꜣ(Y6).tn ⸗f.

2) The second way is to describe parts of the classifier, for example: p.t[SKY] or i҆bꜣ.tn[DRAUGHTSMAN] _⸗f. This has recently suggested by other egyptologists, see Harel et al. 2023 Mappin the Ancient Mind: iClassifier, a New Platform for Systematic Analysis of Classifiers in Egyptian and beyon, in: Lucarelli/Roberson, Ancient Egypt, New Technology, 130-158. However, there are also problems in this annotation because there are sign classifiers we don't what they represent.

What do you think about this question?

Best, Roberto Egyptian_Classifiers.pdf

amir-zeldes commented 5 months ago

That's an interesting question! I think fundamentally you'd need to make a decision about whether you are trying to capture Egyptian as it was (presumably) spoken, or to encode the written system. Personally, I would prefer the latter, since 1. we don't really know exactly how Ancient Egyptian was spoken (for example we don't have most of the vowels, and some of our interpretations of the consonants are also not certain), and 2., throwing out the classifiers would be a loss of information.

If you accept the premise that the classifiers should be represented in the treebank, I think you have two or three options:

Make them into tokens, as you suggest
Fold them into feature annotations on the lexemes they categorize
(this is a variant of 1, you could use MWTs and ignore the classifier in the MWT token, but realize it in the analyzed subtokens)

The first option brings written Egyptian in line with languages that have phonologically verbalized classifiers, like Chinese or Japanese. The second is more of a compromise, saying something like "I want to preserve this information so I'll annotate it, but these aren't exactly words in the language". Both have merits, but maybe I like 2. a little more, since it allows you to 'have your cake and eat it'. Additionally, 2. introduces ways of encoding word-internal classifiers without disrupting the syntax tree.

Option 3. is sort of a sub-version of 1., but I think it's maybe the most confusing thing to do. It allows you to say "on a plain words level, Egyptian has no classifiers, but underlyingly in some re-analyzed form, they are there". This is maybe similar to saying "French really only has an over word 'au', but underlyingly we can think the words 'a' and 'le' are in there. The difference is that French really does utter words like 'a' and 'le' in other environments, and the Egyptian classifiers in question are presumed to be totally unpronounced.

Finally, regarding what notation to use for the classifier, I would prefer something graphemic over semantics ([SKY]) or pseudo-phonological, since semantics are debatable (and not always knows as you say), and phonology is not really relevant here - plus some hieroglyphs have multiple pronunciations. So in sum, I would say Gardiner codes make the most sense, since hieroglyphs are guaranteed to have those and it just represents the data, with minimal interpretation. Just my 2c of course!

dan-zeman commented 5 months ago

My first impression is that this is a different usage of the term classifier from the one used in UD. Here it is mostly about the writing system, so it is not part of the language proper (because it would not be pronounced). This should be documented as yet another case where established language-specific terminology clashes with the crosslinguistic terminology in UD, and the Egyptian classifiers should then either get a different term, or be always qualified by an adjective to avoid confusion.

Then I think the Egyptian "classifier" should not have a separate line, it should be part of a word together with the phonetic material. And maybe one could simply provide the text in the Egyptian characters (Unicode) so that we do not have to search for an ID that would represent them.

Stormur commented 5 months ago

Could there be some hybrid annotation where we allow a multiword token to be decomposed in VERB/NOUN/... + SYM, even in an "infixing" case like the last one? I do not know how much this coincides with option 3 by Amir above.

Else I agree with Dan that these are not classifiers in the "Asiatic" sense and that there should be no lexical annotation about them. Maybe a parallel one in MISC?

amir-zeldes commented 5 months ago

Else I agree with Dan that these are not classifiers in the "Asiatic" sense and that there should be no lexical annotation about them. Maybe a parallel one in MISC

Yes, MISC is always an option, but maybe even in FEATS, since this is really a word-level class attribute, a bit like a large gender system in Bantu languages etc. (except that as far as we can tell, it only applies in writing)

Stormur commented 5 months ago

Else I agree with Dan that these are not classifiers in the "Asiatic" sense and that there should be no lexical annotation about them. Maybe a parallel one in MISC

Yes, MISC is always an option, but maybe even in FEATS, since this is really a word-level class attribute, a bit like a large gender system in Bantu languages etc. (except that as far as we can tell, it only applies in writing)

This is the big divide, I think: we should not put extra-morphosyntactic information in FEATs. And also, as discussed, a very specific classification system is needed here.

Bantu classifiers are truly part of the morphonology of the language (then, that we might try to find a more general and harmonised way to annotate them with respect to the current ultra-specific one, is another story).

sylvainkahane commented 5 months ago

I agree with @dan-zeman: we must avoid to introduce tokens for elements which are not true linguistic units. But the internal structure of the written form can be made explicit in special features similar to MSeg and MGloss. But it cannot be MSeg and MGloss (which concern the morphological decomposition) and we must propose new features specific to the written form. How to call them? WSeg and WGloss?

amir-zeldes commented 5 months ago

Yes, agreed with everyone that introducing tokens is the less UD-like option. If MISC were used for this, then any key could be used, for example HieroClf=A1 etc. But I'm not sure we shouldn't just treat this as FEATS. There is already a precedent of using NounClass with lots of values for Bantu, and this is largely a property of lexemes (but also of verbs in Egyptian, so NounClass wouldn't be quite right).

Another question is whether or not to include the classifier in the textual representation, so is the noun:

1   p.t.N1  p.t NOUN    _   Gender=Fem|Number=Sing|WordClass=N1 0   root    _   SpaceAfter=No

or just:

1   p.t p.t NOUN    _   Gender=Fem|Number=Sing|WordClass=N1 0   root    _   SpaceAfter=No

If we remove it from the word's FORM field, then the original text is no longer reconstructible from the tokens (though admittedly if we're using phonological transcriptions like "p.t", the original hieroglyphs can't be reconstructed either way)

UD-Egyptian commented 5 months ago

Thank you for your interesting comments. It is true that information will be lost if Egyptian classifiers are not annotated. However, this information is dispensable in a morphosyntactic analysis because Egyptian classifiers mainly provide semantic information. Thus, the first conclusion is that the annotation of Egyptian classifiers is not needed in UD.

However, the treebank could be useful to researchers of the Egyptian script if classifiers were annotated with a key in the features of the words they accompany in the text. What is still unclear is where the annotation should be placed, in MISC or in FEATS? As suggested by Amir, the key should be for example HieroClf=A1, and for those signs without an ID in the Gardiner list HieroClf=(x). This would help future researchers to identify the classifiers for their analysis. Would that be an acceptable solution?

dan-zeman commented 5 months ago

However, the treebank could be useful to researchers of the Egyptian script if classifiers were annotated with a key in the features of the words they accompany in the text. What is still unclear is where the annotation should be placed, in MISC or in FEATS? As suggested by Amir, the key should be for example HieroClf=A1, and for those signs without an ID in the Gardiner list HieroClf=(x). This would help future researchers to identify the classifiers for their analysis. Would that be an acceptable solution?

FEATS would be acceptable for me, although this is about orthography rather than the language proper. We already have at least one feature that pertains exclusively to orthography ([Typo]()); and another example where the morphological annotation depends on orthography is [PROPN]() in some languages (not in English).

Nevertheless, my preferred solution would be to actually provide the hieroglyphic text in the corpus directly. I would probably make it the main text (the FORM column) and move the current Romanization to the Translit attribute in MISC. But it is also conceivable to reverse it, i.e., keep the transcription as the main text and put the hieroglyphs in MISC (either as Translit or as some new attribute such as Hiero).

UD-Egyptian commented 5 months ago

Nevertheless, my preferred solution would be to actually provide the hieroglyphic text in the corpus directly. I would probably make it the main text (the FORM column) and move the current Romanization to the Translit attribute in MISC. But it is also conceivable to reverse it, i.e., keep the transcription as the main text and put the hieroglyphs in MISC (either as Translit or as some new attribute such as Hiero).

There is a Unicode block for Egyptian Hieroglyphs based on Gardiner list:

https://en.wikipedia.org/wiki/Egyptian_Hieroglyphs_(Unicode_block)

I tested them on a sentence from the Egyptian treebank:

It looks good to me. I have annotated the hieroglyphs in the MISC column because hieroglyphic texts usually omit important information such as the suffix pronoun i҆ (𓀀) used as a possessive pronoun or as a subject. However, there are still some problems: 1) Unicode hieroglyphs cannot be used on top of each other as in the original, cf.:

2) Uncommon signs are not in the Unicode list. In this case, the key Hiero=(x) can be used for the uncommon sign. Or I could contact the authors of the Unicode list and ask them to add new hieroglyphs to their list. Do you know, may be, the authors of the Unicode list?

dan-zeman commented 5 months ago

I have annotated the hieroglyphs in the MISC column because hieroglyphic texts usually omit important information such as the suffix pronoun i҆ (𓀀) used as a possessive pronoun or as a subject.

OK, good. I did not know that.

Unicode hieroglyphs cannot be used on top of each other as in the original

Is it correct to assume that for every sequence of hieroglyphs we can deterministically say what is the preferred rendering? For example, wide low characters want to be on top of each other, tall narrow characters do not? Then I would say that it is just an imperfection of the rendering software we are using, but the file encoding is fine. It would be a bigger problem if the top-down stacking actually conveyed extra information.

Uncommon signs are not in the Unicode list. In this case, the key Hiero=(x) can be used for the uncommon sign. Or I could contact the authors of the Unicode list and ask them to add new hieroglyphs to their list. Do you know, may be, the authors of the Unicode list?

This is a more severe problem but if you can represent such characters exceptionally by an ID, it could help. Unfortunately I do not know the authors of the Unicode block (but I suppose there should be some kind of contact/feedback at unicode.org).

yosiasz commented 5 months ago

https://unicode.org/reporting.html

On Tue, Jun 18, 2024, 2:05 AM Dan Zeman @.***> wrote:

I have annotated the hieroglyphs in the MISC column because hieroglyphic texts usually omit important information such as the suffix pronoun i҆ (𓀀) used as a possessive pronoun or as a subject.

OK, good. I did not know that.

Unicode hieroglyphs cannot be used on top of each other as in the original

Is it correct to assume that for every sequence of hieroglyphs we can deterministically say what is the preferred rendering? For example, wide low characters want to be on top of each other, tall narrow characters do not? Then I would say that it is just an imperfection of the rendering software we are using, but the file encoding is fine. It would be a bigger problem if the top-down stacking actually conveyed extra information.

Uncommon signs are not in the Unicode list. In this case, the key Hiero=(x) can be used for the uncommon sign. Or I could contact the authors of the Unicode list and ask them to add new hieroglyphs to their list. Do you know, may be, the authors of the Unicode list?

This is a more severe problem but if you can represent such characters exceptionally by an ID, it could help. Unfortunately I do not know the authors of the Unicode block (but I suppose there should be some kind of contact/feedback at unicode.org.

— Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/docs/issues/1039#issuecomment-2175588442, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG36V53GURV7YC54EF4PW3ZH7Z7LAVCNFSM6AAAAABJOBFHX6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZVGU4DQNBUGI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

UD-Egyptian commented 5 months ago

Unicode hieroglyphs cannot be used on top of each other as in the original

I have found a guide to place hieroglyphs (see file 1). 21248-egyptian-controls.pdf

But it is difficult to understand. The first step is to write Unicode hieroglyphs on the computer because until now I have copied and pasted the Unicode hieroglyphs. Although I have UniCode Hex-Eingabe on my Mac, I cannot write Unicode Hieroglyphs. For example, if I try to write the code U+13000, I can only write U+1300 and then I get this sign ጀ (I think it is ethiopic). According to this page, I need a Unicode font and the utf-16 code:

https://discussions.apple.com/thread/7940197?answerId=31697038022&sortBy=best#31697038022

Any help would be appreciated.

Uncommon signs are not in the Unicode list. In this case, the key Hiero=(x) can be used for the uncommon sign. Or I could contact the authors of the Unicode list and ask them to add new hieroglyphs to their list. Do you know, may be, the authors of the Unicode list?

I found this contact for new hieroglyphs:

Thot Sign List (thotsignlist@gmail.com).

yosiasz commented 5 months ago

what are you trying to write it into? i can try to help, this is so fascinating

On Tue, Jun 18, 2024, 4:44 AM Roberto A. Díaz Hernández < @.***> wrote:

Unicode hieroglyphs cannot be used on top of each other as in the original

I have found a guide to place hieroglyphs (see file 1). 21248-egyptian-controls.pdf https://github.com/user-attachments/files/15885803/21248-egyptian-controls.pdf

But it is difficult to understand. The first step is to write Unicode hieroglyphs on the computer because until now I have copied and pasted the Unicode hieroglyphs. Although I have UniCode Hex-Eingabe on my Mac, I cannot write Unicode Hieroglyphs. For example, if I try to write the code U+13000, I can only write U+1300 and then I get this sign ጀ (I think it is ethiopic). According to this page, I need a Unicode font and the utf-16 code:

https://discussions.apple.com/thread/7940197?answerId=31697038022&sortBy=best#31697038022

Any help would be appreciated.

Uncommon signs are not in the Unicode list. In this case, the key Hiero=(x) can be used for the uncommon sign. Or I could contact the authors of the Unicode list and ask them to add new hieroglyphs to their list. Do you know, may be, the authors of the Unicode list?

I found this contact for new hieroglyphs:

Thot Sign List @.***).

— Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/docs/issues/1039#issuecomment-2175901009, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG36VZXGAZQZXCWN6ZZFKDZIAMRZAVCNFSM6AAAAABJOBFHX6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZVHEYDCMBQHE . You are receiving this because you commented.Message ID: @.***>

UD-Egyptian commented 5 months ago

I am trying to write Unicode hieroglyphs for the Egyptian Treebank. I can only copy and paste them, but I want to produce them using a code, for example U+13000. When I try it, I can only enter U+1300 and I get this sign ጀ. I cannot enter U+13000. I don't know why. If I cannot produce hieroglyphs, I cannot place them as they occur in the original. Here are the charts of the Unicode hieroglyphs:

Unicode_Basic.pdf

Stormur commented 5 months ago

Copying the hieroglyphs or inputting them through the code points should give the same result. The latter method might be more practical if you know all the codes already, else I do not see a difference! :-)

You are trying to enter codes that are longer than 4 digits. To do this, you have to pad them up to 8, e.g. U + 00013000. This is because we have, as it were, two bits here, 0001 and 3000. For the vast majority of scripts the first is always 0000, that is why 1300 suffices for the Amharic character ጀ.

But I sincerely do not know how to do this in a general text editor: copypasting still seems the easiest option to me. In Python, you can use e.g. '\U0001316c' for 𓅬.

As for character combinations, this seems more complex. I tried combining 𓅬 (1316C) and 𓇳 (131F3) by means of 13434, but I did not succeed.

The Egyptian classifiers should go to MISC, but since they are written signs I would say that they naturally have to stay in the FORM: it is just their semantic annotation that is extra-morphosyntax. It would be interesting to see how much universalised a similar annotation can get (are cuneiform scripts also not using a similar logic?).

As for representing non-linguistic units, I can point to the fact that we already have Arabic numerals, which are purely symbolic, punctuation marks and other symbols, and that PUNCT and SYM are in fact non-lexical parts of speech pertaining only to the written medium. So this is not so different here: as long as these classifiers are factually part of the written expression and have some measure of independence (unlike, say, an apostrophe in it. quant' vs. quanto), we could easily envision a segmentation which also takes into account SYMs.

UD-Egyptian commented 5 months ago

Copying the hieroglyphs or inputting them through the code points should give the same result. The latter method might be more practical if you know all the codes already, else I do not see a difference! :-)

If I cannot input hieroglyphs by using the code (first step), I cannot place them as in the original (second step), for example:

If I copy and paste them, I can only write them in a sequence, for example: 𓄡𓏏𓏤 The best for the treebank would be to write them as they are in the original.

You are trying to enter codes that are longer than 4 digits. To do this, you have to pad them up to 8, e.g. U + 00013000. This is because we have, as it were, two bits here, 0001 and 3000. For the vast majority of scripts the first is always 0000, that is why 1300 suffices for the Amharic character ጀ.

Unfortunately, the hierolgyph does not appear when I enter U00013000. It just shows a blank. Do I need a font or something similar?

Stormur commented 5 months ago

Unfortunately, the hierolgyph does not appear when I enter U00013000. It just shows a blank. Do I need a font or something similar?

I do not think so if you can vidualise them when you copy them.

If I cannot input hieroglyphs by using the code (first step), I cannot place them as in the original (second step), for example: If I copy and paste them, I can only write them in a sequence, for example: 𓄡𓏏𓏤 The best for the treebank would be to write them as they are in the original.

I am with Dan here that you probably have to renounce to this representation for the time being. But:

this is a minor problem if, as Dan pointed out, the disposition of hieroglyphs is predictable;
you can still devise a way to represent their configuration by means of operators (+, :, parentheses...), if there is not yet one

amir-zeldes commented 5 months ago

the disposition of hieroglyphs is predictable

I don't think this is 100% correct, IIRC there are alternative ways of arranging the same hieroglyphs.

Another option is to use 'math-like' notation, I've seen people do this with both Gardiner codes and unicode. You can use a string like:

(N35 / (X1 + Z4))

To mean:

The nice part of that is that the linear sequence of hieroglyphs is trivial to extract from such strings (N35 X1 Z4), but you can convey spatial layouts using mathematical operators which can't be confused with characters. I think this notation comes from an old Windows hieroglyph tool called WinGlyph (or maybe it's even older).

amir-zeldes commented 5 months ago

you can still devise a way to represent their configuration by means of operators (+, :, parentheses...), if there is not yet one

Whoops, just saw your comment @Stormur , that's exactly what I meant!

Stormur commented 5 months ago

the disposition of hieroglyphs is predictable

I don't think this is 100% correct, IIRC there are alternative ways of arranging the same hieroglyphs.

Or at least this "block combinations" have the same meaning overall... else I would not know where to bang my head!!! :exploding_head:

dan-zeman commented 5 months ago

I am trying to write Unicode hieroglyphs for the Egyptian Treebank. I can only copy and paste them, but I want to produce them using a code

You can try my tool here, then copy-paste the result from the page. The tool itself is ancient, I just added some Egyptian support now. Anything between -egy1- and -egy0- will be interpreted as hieroglyphs if possible. A period followed by a hexadecimal Unicode (e.g., .13000) will be replaced by the character corresponding to that codepoint. The range U+13000 to U+1342F is covered. I could use a different character than period if it is more convenient. Optionally, you can omit the initial "13" and you should get the same result. Furthermore, the Latin(-like) characters from the conversion table in your README should yield the correspoinding phonetic Egyptian character.

Right now it does not do anything about the 2-dimensional placement of the characters but I can look into it later.

UD-Egyptian commented 5 months ago

Thank you Dan! This is great! and it is easy to use :D. When you have time, you can add the extended library of Unicode hieroglyphs. See attached file. Unicode_Extended.pdf.

As Amir and Stormur said, we can use a notation to place the hieroglyphs. The notation used in JSESH editor is this:

Colon (:) to place hierolgyphs on top of each other, for example 13121:133CF corresponds to:

Dok2

Asterisk to place hieroglyph beside of each other, for example 13121:133CF133E4 corresponds to:

Dok2

dan-zeman commented 5 months ago

When you have time, you can add the extended library of Unicode hieroglyphs.

All right, extending the coverage to U+143FF is easy (just wasting a bit more memory :-)) but it is unclear to me whether this is only a proposal at the moment, or has it already been approved; anyway, my system does not seem to support the new characters, so I get just the default blank boxes.

And it seems to be the case also with the formatting control characters, unfortunately, although they have been part of the standard for some time already. (Kind of reminds me of the early 1990s when there was a lot of excitement about Windows NT "supporting" the early versions of Unicode, but I could hardly use it in any of the programs I worked with... And it took at least a decade to improve.)

yosiasz commented 5 months ago

might some scripting be of help here? pytjon for example to do the heavy lifting?

On Tue, Jun 18, 2024, 8:56 AM Dan Zeman @.***> wrote:

When you have time, you can add the extended library of Unicode hieroglyphs.

All right, extending the coverage to U+143FF is easy (just wasting a bit more memory :-)) but it is unclear to me whether this is only a proposal at the moment, or has it already been approved; anyway, my system does not seem to support the new characters, so I get just the default blank boxes.

And it seems to be the case also with the formatting control characters, unfortunately, although they have been part of the standard for some time already. (Kind of reminds me of the early 1990s when there was a lot of excitement about Windows NT "supporting" the early versions of Unicode, but I could hardly use it in any of the programs I worked with... And it took at least a decade to improve.)

— Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/docs/issues/1039#issuecomment-2176449438, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG36VZL3ERLRJS7APE3BMLZIBKEBAVCNFSM6AAAAABJOBFHX6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZWGQ2DSNBTHA . You are receiving this because you commented.Message ID: @.***>

UD-Egyptian commented 5 months ago

All right, extending the coverage to U+143FF is easy (just wasting a bit more memory :-)) but it is unclear to me whether this is only a proposal at the moment, or has it already been approved; anyway, my system does not seem to support the new characters, so I get just the default blank boxes.

According to this page, the proposal was approved in January 2024:

https://www.unicode.org/alloc/Pipeline.html

yosiasz commented 5 months ago

not sure if this might help

base = 0x13000
for n in range(0, 9):
    rah = chr(base + n)
    print(rah)

dan-zeman commented 5 months ago

According to this page, the proposal was approved in January 2024:

https://www.unicode.org/alloc/Pipeline.html

Good to know. It is nice but unfortunately it does not mean that all systems will immediately support it. One would have to find and install a font that supports the extended Egyptian block. And when I searched specifically for "extended", I found Aegyptus, which seems to have the glyphs, but not at the positions that were ultimately assigned to them. We'll have to wait but eventually a font should be available.

dan-zeman commented 5 months ago

not sure if this might help

base = 0x13000
for n in range(0, 9):
    rah = chr(base + n)
    print(rah)

Yes, that's essentially what I have in the tool mentioned above. The main problem is not that we could not generate the characters but that we will not see the correct glyphs (of some of the characters) because current fonts do not support them. Or at least my fonts don't. Try

base = 0x14000

instead. If you have a font with the extended hieroglyphic block, you should see hieroglyphs.

UD-Egyptian commented 5 months ago

Good to know. It is nice but unfortunately it does not mean that all systems will immediately support it. One would have to find and install a font that supports the extended Egyptian block. And when I searched specifically for "extended", I found Aegyptus, which seems to have the glyphs, but not at the positions that were ultimately assigned to them. We'll have to wait but eventually a font should be available.

Actually, the use of the extended library can wait because many hieroglyphs can be annotated using the basic Gardiner list. Now, it would be useful to find out the way how to arrange and combine hieroglyphs in your tool. This would allow a reliable annotation of hieroglyphs in the Egyptian treebank.

dan-zeman commented 5 months ago

Actually, the use of the extended library can wait because many hieroglyphs can be annotated using the basic Gardiner list. Now, it would be useful to find out the way how to arrange and combine hieroglyphs in your tool. This would allow a reliable annotation of hieroglyphs in the Egyptian treebank.

I am afraid that it depends on support within the font as well. This standard was approved earlier, so one would hope that it is already supported, but unfortunately it seems to be quite difficult to implement and there are no profit-related incentives to speed it up (sadly, not too many companies communicate in hieroglyphs these days :-)).

For the time being, I would propose that you use something along the lines of @Stormur's and @amir-zeldes' suggestions. ASCII colon will be more readable than U+13430 because editors and browsers know how to display it. And when we identify a font that supports the 2D character arrangement, we should be able to replace the ASCII characters with the Unicode control characters using a simple script.

UD-Egyptian commented 5 months ago

For the time being, I would propose that you use something along the lines of @Stormur's and @amir-zeldes' suggestions. ASCII colon will be more readable than U+13430 because editors and browsers know how to display it. And when we identify a font that supports the 2D character arrangement, we should be able to replace the ASCII characters with the Unicode control characters using a simple script.

Sorry, but I don't understand what ASCII means. Do you mean the annotation of hieroglyphs instead of codes? According to the Unicode charts for Egyptian hieroglyphs (see below), colon (:) is used for vertical groups and asterisk for horizontal groups, for example 𓄡:𓏏 (vertical group) and 𓄡:𓏏𓏤 (horizontal group). Would this annotation be valid to replace them when a font is found?

dan-zeman commented 5 months ago

Sorry. ASCII is an old standard. You can think of it as the first 128 characters of Unicode. By "ASCII colon" I meant ":", i.e. the colon character that your keyboard generates when you write in modern languages, i.e., nothing fancy that may look like a colon but reside somewhere in the hieroglyph block.

UniversalDependencies / docs

Annotation of Classifiers in the Egyptian-UJaen Treebank #1039