BenTalagan / glaemscribe

Glaemscribe, the tolkienian languages/writings transcription engine.
https://glaemscrafu.jrrvf.com/english/glaemscribe.html
Other
44 stars 8 forks source link

[Discussion/Specification] Relayout/Remap legacy fonts ? #15

Closed laicasaane closed 5 years ago

laicasaane commented 6 years ago

I've just found out that non-breaking space is missing in Annatar, Eldamar and Sindarin fonts. If I use that space, its font will be changed to the default and the spacing is not correct.

Also, Glaemscribe processor seems to ignore the non-breaking space. I've added a definition for it in the charset and use it in the mode, but the output text contains only normal spaces.

BenTalagan commented 6 years ago

Hi Laicasaane! Nice to see you here again. I've just checked these fonts and you're right. This is an interesting 'fine-tuning' matter that I've never thought about. I'll fix this and provide you with a solution today or tomorrow. I have also quite a few bug fixes/small changes here and there that I need to commit - that were waiting to be released in march, but I'll probably profit of the occasion to do a release during the week-end then :)

BenTalagan commented 6 years ago

NB : After looking more closely, the non-breaking space does not seem to miss in the Sindarin font.

laicasaane commented 6 years ago

You are right about Sindarin.

Just to be clear a little bit, here is the example for non-breaking space usage in my mode: {PUNCTUATIONS} --> NBSPACE {_PUNCTUATIONS_} SPACE The expected result is p ! won't be separated at the end of line.

BenTalagan commented 6 years ago

Thanks for the explanation ; this is very clear, and your solution is very clever. I love it because I always put a space before punctuation signs in tengwar, and I see no cases where this should not be done that way. (Now I'm pondering seriously the eventuality of integrating that way of doing to the glaemsrafu's website transcriptions :) )

For your specific need, adding the NBSPACE character to the charsets will be effectively sufficient. I am also looking for a way not to lose the non-breaking spaces if they are present in the input.

BenTalagan commented 6 years ago

Okay. I've released the 1.1.8 version, and it should address your issue. It was a bit tricky, because under the hood, the engine is using at one or two places functions like "trim" which will do unpredictable things to nbsp. To avoid this, I am using a pre-substitution of all nbsp to another character (in the dingbats range) and allow the use of a pre-defined variable {NBSP} which will match this input char.

I had already added this technique for another mode which I will publish in march (a japanese tengwar mode). I wanted to be able to use the '_は' combination in input to disambiguate the 'は' kana from the ending subject 'は' (pronounced wa), and realized that '_' could not be used in rule input because it is already used for defining word beginnings / ends. So I have also added a pre-defined variable {UNDERSCORE} for that purpose.

Fonts have been updated accordingly @here .

Charsets have been updated accordingly, with the NBSP char definition.

I have also added the following rules to all tengwar modes : {NBSP} --> NBSP

This will allow to keep non-breaking spaces from inputs.

In your use case, I believe a cool addition would be to treat in the same way punctuation signs preceded by nbsp or not. This would add nbsp only if it is missing. A simple example :

({NULL},{NBSP}) . --> NBSP PUNCT_DOT

laicasaane commented 6 years ago

I have tested the new version, but the problem still persists. Ah, I use the mode editor for testing, I wonder if it's the cause?

screenshot

BenTalagan commented 6 years ago

Nay, it's because you've also tried to make it work with a regular space. I think it will be difficult to make it work that way because space is used at low level in the engine for tokenization of words so we're not on solid ground here (that's why there is no {SPACE} variable pre-defined, that you would be able to use here --- and the rule that you've wrote would do the following thing : transcribe "space!" by " !" :D )

However, you can try to input p! and you will see that it outputs TINCO NBSP PUNCT_EXCLAM, so this was what you were trying to achieve at first.

Still, there's a way to treat regular spaces the way you want to do it : in the preprocessor with a regexp. You can replace all \s*! by  ! with a non-breaking space (and generalize for all punctuation chars).

BenTalagan commented 6 years ago

Hum maybe, I haven't read well one of your first post - sorry ^^

The expected result is p ! won't be separated at the end of line.

You were already explaining that you'd like to cover the regular space+punctuation case, but I missed that point because I was focusing on the rule itself. Still, I think it falls into the 'disambiguation/normalisation' category rather than into the transcription logic category, so it'd be better done with the preprocessor (in the spirit of 'correcting' the user input).

However I've just found that there's a bug with my latest patch if you want to do it with the preprocessor. I'm investigating.

BenTalagan commented 6 years ago

Found the problem : this was due to my patching of special chars (underscore/nbsp) before applying the preprocessor instead of between the preprocessor and the processor. I've released a patch. Now you should be able to write things like :

\rxsubstitute "\\s*([!.:;])" " \\1"

in the preprocessor (meaning replace a serie of spaces OR nothing + punctuation sign by nbsp + punctuation sign).

Of course this would simplify a lot your logic in the punctuation section, you would only need to keep : {NBSP} --> NBSP.

You can also extend the rule to add a space AFTER the punctuation sign if it's missing (I think it was your intention).

laicasaane commented 6 years ago

I really want to avoid preprocessing as much as I can. Because there is something that simply not appropriate to be preprocessed.

BenTalagan commented 6 years ago

Spaces are inherent to the way words are separated. Processor rule groups define classes of characters that define how "words" are made (a serie of characters of from the same class). Spaces are the only characters which are special for processing the input. They are not part of words (except for nbsp which could be). This can't be changed, this would mean complex and unuseful changes in the engine.

There's a pre-processor in glaemscribe made especially for these kinds of hacks. It offers you a solution for your problem. You just have to add one rule in the preprocessor, \rxsubstitute "\\s*([!.:;])" " \\1". You have a solution to your problem within glaemscribe's preprocessor, so why complaining?

laicasaane commented 6 years ago

I've made a test, it seems my issue is caused by the editor: it cannot reserve the non-breaking space character and always convert it into normal space.

My initial choice was putting normal spaces around the punctuation. But then I realized that the normal space might cause the punctuation to be moved to the next line alone. Expected:

...(very long text)... p !

Real result:

...(very long text)... p 
!

So I want to put a non-breaking space before the punctuation instead and the whole block p ! will be moved to the next line, rather than just !. Expected after using non-breaking space:

...(very long text)...
p !
BenTalagan commented 6 years ago

I have tested with the editor yesterday and it was working. But I had to be very careful because copy-pasting nbsp sometimes replaces it with a normal space.

Be sure to have a real nbsp in the rule \rxsubstitute "\\s*([!.:;])" " \\1", just before \\1.

One way to do it is copy paste it from chrome/firefox console :

"\u00a0"

then press enter, then copy paste the resulting space.

My own nbsp did not survive the copy pastes up to github it seems ...

laicasaane commented 6 years ago

Right, the problem is copy-paste. I've printed the result's codepoints to firefox console and there is \u00a0 indeed.

BenTalagan commented 6 years ago

Cool, thanks for your confirmation!

I wish I could simplify the parsing of args so that we don't have the double escapings \\ . Also it would be nice if we could enter \u00a0 directly instead of a real nbsp in the preprocessor rules but for now it does not seem to work. This is one of the things that is really old and has not been refactored.

laicasaane commented 6 years ago

Nah, I think the problem is not copy-paste. After investigate the glaemscribe_editor.js file, I think this is the place where the editor lost all NBSPs in the output.

transcribed_selector.html(ret)

I suggest to replace \u00a0 with &nbsp; before sending it to the output panel. Also, you should replace \n with <br> to retain line-breaks.

BenTalagan commented 6 years ago

😲 damn you're right! This looks like a firefox 'bug' ; I'm under Vivaldi (chromium) and it does not behave the same. My test from the console :

$(".transcribed").html("1111\u00a01111")

Firefox effectively does not translate the nbsp; . I guess there's no other patch than yours. I will investigate tonight. Thanks for noticing!

laicasaane commented 6 years ago

Well, I've looked around and made a little test, but it didn't solve the problem. So I think the best method should be exporting the output to an actually file.

BenTalagan commented 6 years ago

Hmm after investigating, when I input this command :

$(".transcribed").html("1111\u00a01111")

Firefox DOES put a nbsp; . It just does not show it in the elements inspector, but if you do "Edit as html" it will show the nbsp; .

What is your web browser ?

laicasaane commented 6 years ago

Ah, yes, indeed I've forgot to check "Edit as HTML". But either way it's not what I want. I just want to copy the output text and paste it to Word more easily. So I was looking for a way to retain the NBSP in the editor. But it turns out that there is no solution since no browser actually shows &nbsp; inside the HTML texts. Then the only possible solution is to have a button to export the output text as a file. I have made a change to the Glaemscribe editor in my local machine to serve my current needs. The problem is now solved. 😃

BenTalagan commented 6 years ago

Ahhhh ok 😃

It's very clear to me now. Funny enough, I've just checked : this bug is both present in chrome and firefox, when you select+copy, nbsps are lost (at least under macos) !!

Worse than with the output, this is quite dangerous if copy-pasting the code of the mode itself (which I do between textmate and the editor), so I probably should rework the preprocessor rules syntax to be able to use '\u00a0' for safety.

Also, I should probably add a 'copy html' button to both the editor and the official glaemscribe UI later for that purpose.

Sorry for the misunderstandings in this thread, there was a lot of incomprehension, this is was you get when struggling with ghostly invisible chars! 😃 Anyway, it's cool that it's working now. And imho the solution using a preprocessor rule is simpler to write and understand and more logical (you clean the input before the processing).

laicasaane commented 6 years ago

You don't have to say sorry. It's me who wasn't clear enough from the beginning. Thanks a lot for your support!

BenTalagan commented 6 years ago

Ok, so I have rewrote the glaeml args parser to be able to handle unicode escaped characters (such as \u00a0), but also a few others : \n \t \\ . This should not break any existing mode, but will allow you to cleanly write your preprocessor rules for non-breaking spaces. This is available with 1.1.12.

I have also added a small toolbar in the editor. It will allow to do a copy to the clipboard of the current transcription. It relies on a trick (create a hidden textarea/put the transcription inside/launch the copy). It works well under at least chrome & safari - textareas seem to behave better than regular divs that lose nbsp when copying.

Unfortunately Firefox still loses the non breaking spaces which is lame ^^; . I guess you're right, only the "Save to file" feature would be the safest.

For now, I'm happy with the current toolbar because I'm working under Vivaldi, but if you think it'd be useful to have a "save to file" button, I could easily add it as well.

laicasaane commented 6 years ago

Thanks. Direct copying is great. But if browsers don't behave the same then you should include "save to file" button.

It's so wonderful that the parser can handle unicode escape in rxsubstitute! Wah, does that mean we can now write {COMMA} === \u002c?

BenTalagan commented 6 years ago

Thanks. Direct copying is great. But if browsers don't behave the same then you should include "save to file" button.

I see no reasons not to do it in the editor... so it's done, and pushed 😃 (and finally it works well with Firefox!)

I have quite a lot of changes to the official UI, but I keep their publication for march. So I may add these export features to it too, but there are already quite a lot of options so I always fear to overload the UX with too many features.

It's so wonderful that the parser can handle unicode escape in rxsubstitute! Wah, does that mean we can now write {COMMA} === \u002c?

Haha. Good question but nope, sorry. Because it is only implemented for glaeml arguments - and rules are written in glaeml text nodes. You're pushing the limits of the engine 😄

However, your question is interesting. I should probably add more predefined variables for characters that are used in the glaemscribe rules syntax. For the moment, we only have :

{NULL}, {NBSP}, {UNDERSCORE}

But others like {COMMA}, {ASTERISK}, {LPAREN}, {RPAREN}, {LBRACKET} and {RBRACKET} could also be added. It's hard for me to see some use cases, but why not. It may just cost a little more processing resources.

laicasaane commented 6 years ago

I should probably add more predefined variables for characters that are used in the glaemscribe rules syntax. But others like {COMMA}, {ASTERISK}, {LPAREN}, {RPAREN}, {LBRACKET} and {RBRACKET} could also be added. It's hard for me to see some use cases, but why not. It may just cost a little more processing resources.

You don't have to do this kind of work because a bunch of predefined variables for these symbols aren't what I need. I just think that if we can make unicode escape on the right-hand side of === that would be good sometime. Since my mode contains various intermediate symbols (as a result of the preprocessor), and some symbols aren't properly shown in text editors.

For example these rules convert numbers to their intermediate form with decimal mark, digit group separator, negative brackets:

1.2     > 1٬2
-1.2    > 【1٬2】
1,000.4 > 1٬000٫4 
\** Place negative numbers in quotes **\
\rxsubstitute "\\-(((\\d+(\\,\\d+)*)+((\\.\\d+)|(\\,\\d+)+)*)|((\\.\\d+)|(\\,\\d+)+)+)" "【\\1】"

\** Convert decimal mark **\
\rxsubstitute "[.](\\d)"      "\u066b\\1"
\rxsubstitute "(\\d)[.](\\d)" "\\1\u066b\\2"

\** Convert digit group separator **\
\rxsubstitute "^[,](\\d)"    ", \\1"
\rxsubstitute "\\s[,](\\d)"  ", \\1"

\rxsubstitute "(\\d)[,](\\d)" "\\1\u066c\\2"
\rxsubstitute "[,](\\d)"      "\u066c\\1"
{LNEGATIVE}           === 【
{RNEGATIVE}           === 】
{DECIMAL_MARK}        === ٫
{DIGIT_GROUP_MARK}    === ٬
laicasaane commented 6 years ago

However, I think you should add these symbols beside of NBSP: word-joiner, word-divider, zero-width space, zero-width non-joiner.

BenTalagan commented 6 years ago

However, I think you should add these symbols beside of NBSP: word-joiner, word-divider, zero-width space, zero-width non-joiner.

Sure, because there are no other ways to make them work now (even if \uXXXX was working). The parsing of rules is done with regexps full of \s and calls to strip/trim, and they are unreliable concerning nbsp and other special spaces. So currently they would probably need the 'fake char' cheat to work because I don't see myself rewriting the whole rule parsing yet :)

You don't have to do this kind of work because a bunch of predefined variables for these symbols aren't what I need.

Yes, but they might prove useful at some point - imagine a mode where you'd like to use '' at the end of some words to mark special terminations : currently it's not possible to do it because will probably break the rule. Same problem existed for underscore, hence my patch.

I just think that if we can make unicode escape on the right-hand side of === that would be good sometime. Since my mode contains various intermediate symbols (as a result of the preprocessor), and some symbols aren't properly shown in text editors.

It could be cool to add the feature at lower into the glaeml norm, and allow some escaped chars in text nodes. It would be very close to what I've done for glaeml args, but in a simpler version : only \ and \uXXXX, maybe \t and \n but it's not even sure. Thus you would have it everywhere, you could even write everything with just a serie of \uXXXX characters (which is odd).

Since it is non blocking (except maybe for special spaces), I keep it in mind but will take my time to implement these few points.

laicasaane commented 6 years ago

Since it is non blocking (except maybe for special spaces), I keep it in mind but will take my time to implement these few points.

Sure, that should be just a minor enhancement. 😃

laicasaane commented 6 years ago

Since we're still using the non-unicode layout, I think word-joiner should be implemented in the next release. I'm facing the situation like this when the paragraph alignment is set to justified, and the value of the word in red is !.…: image

Currently I must use 2 NBSPs to make the word appear normal. But this is just a hack, really. image

There again, I (we) must fight the text-processor and include word-joiner in between every character.

BenTalagan commented 6 years ago

My diagnosis is different : the sequence !.…is an ugly mapping for something which should be a word. Thus, web browsers believe that this sequence is something not really defined, between words and punctuation ; chrome, for example, will split the sequence if it is at the end of a line. Firefox does not seem to do it. I don't think playing with word joiners is the good way to go. My opinion is that we should continue to remap characters to cleverer places so that the canonical word wrap is not broken. e.g, all tengwar and tehtar should be mapped to non-punctuation, latin letter characters to avoid misinterpretation on what a word is.

It is probably a huge work however since we should identify first what characters are badly mapped.

laicasaane commented 6 years ago

Do you have any automatic tool to help moving the glyphs around? Or you do that manually? Btw, which software are you using to modify the fonts?

BenTalagan commented 6 years ago

I use FontForge (fontforge versions are now commited as sfds in the git repository) and a great dose of carefulness before doing any change since I have to synchronize all fonts (there's an html table in the fonts/doc directory that summarizes all changes that have been made but it starts to be very complex to track).

laicasaane commented 6 years ago

Well, as things grow I think it's the time to make a proper text-processor-friendly layout. I've abandoned the idea of a unicode layout since it would have many limitations (for example, we can't have the page numbering function to work with Tengwar). So when you said about “remap characters to cleverer places”, I've thought of mapping numbers to numerical places, vowels and consonants to their respective places in possible Latin blocks. This is indeed a huge task, so I imagine that I should start sorting out everything in a spreadsheet, then somehow making an automated tool to do the remapping job.

I remember you've said that you were afraid of losing glyph's data (kerning, spacing,...) when moving the glyph around, I'm not familiar with FontForge so, could you confirm that is a case? If we use an automated tool to move the glyphs, could that issue happen?

BenTalagan commented 6 years ago

I've abandoned the idea of a unicode layout since it would have many limitations (for example, we can't have the page numbering function to work with Tengwar).

Now that you're talking about it, I really wonder how, for example, one should describe punctuation, letters, and so on so that a word-processing software is able to interprete custom unicode characters for (for example) word wrapping matters. Because this is font independent, there's probably no way to get these behaviours working within the personal space of unicode. (?).

I remember you've said that you were afraid of losing glyph's data (kerning, spacing,...) when moving the glyph around, I'm not familiar with FontForge so, could you confirm that is a case? If we use an automated tool to move the glyphs, could that issue happen?

I don't have a straight answer to this yet, but yes, that's exactly what I'm afraid of. We should take time to look at FontForge's file format and how glyphs are pointing to each other amidst these features.

However, since FontForge format is an ascii format, one nice thing is that it offers possibilities for scripting (I've already made use of that conveniency e.g. for extracting data, which is harmless).

I would also rather avoid rushing into this without, maybe, getting the opinion of tolkien font designers. Glaemscribe is making its lonely way, but it'd be nicer if we were emerging or sticking to universal choices and principles.

laicasaane commented 6 years ago

how, for example, one should describe punctuation, letters, and so on so that a word-processing software is able to interprete custom unicode characters for (for example) word wrapping matters

That's the limitation of custom unicode blocks in Private Use Area. In my imagine, we can re-map Tengwar letters and punctuations to the most equivalent Latin slots, for example: tehtar to Latin vowels, tinco to T, parma to P, pusta to Word separator middle dot (⸱), double pusta to Semicolon (:), Tengwar numbers to Latin 1, 2, 3, 4,..., and so on. Then the word without Tengwar font applied might appear like pármâ ('á' and 'â' are the positionnal variation of the 'a' tehta on parma and malta respectively). And with 1, 2, 3, 4 are the exact value of the numbers, we can easily have page numbering in Tengwar in any word-processors. (It might be tricky for number 10 & 11 if someone want to use duodecimal system.)

it'd be nicer if we were emerging or sticking to universal choices and principles.

If those "choices" are about the old layout then I don't think this is an appropriate way to follow at the current time when word-processors and browsers are capable of advanced behaviors in languages, and unfortunately, they badly affected Tengwar texts because of the old-school layout. Well, asking for permission of the font author is indeed a must. But we apparently can't avoid this path. The sooner we work on this the fewer headache we might have in the (near) future.

Concerning the above problem, I think any font designer would approve the new layout without much resistance. (And, again, Mr Johan Winge had already given me a permission to change the layout of Tengwar Annatar. He had some thoughts about the new layout too, but didn't have time to work on it.)

laicasaane commented 6 years ago

I have a feeling that there are some misunderstanding between us?

BenTalagan commented 6 years ago

I have a feeling that there are some misunderstanding between us?

I don't think so but I may sound very hasty since I'm under heavy charge professionally (finishing a big project) atm :) So my answers may seem inaccurate or incomplete, sorry for that.

On the contrary, I believe we're on the same level of expectation and comprehension. But, however, I'd like to take time to think carefully of the implications of what we want to do before rushing and working under urgency. I'd like to avoid proposing something too exotic, not universal etc. Thus I feel like there's a need to discuss these matters with a maximum of people experienced in both font design and tengwar knowledge, to avoid designing something flawed (by ignorance).

For example, one could be tempted to use the upcase character range to map some bearer tengwar, but I'm not sure it's a good idea because at the FreeTengwar font project, Mach has already thought about the possibility of having tengwar fonts with both upcase and downcase characters. So ideally, I think each bearer tengwar should be mapped on a downcase char that has an upercase counterpart.

Another example, which gives me headaches concerning the english mode, is the handling of sa-rinci and ar-rinci. There is a lot (and I mean a lot) of versions of these infernal hooks in the last PE and fonts are (without exception) not well enough designed to handle them. Layouting them will be complicated.

BenTalagan commented 6 years ago

I remember you've said that you were afraid of losing glyph's data (kerning, spacing,...) when moving the glyph around, I'm not familiar with FontForge so, could you confirm that is a case? If we use an automated tool to move the glyphs, could that issue happen?

I don't have a straight answer to this yet, but yes, that's exactly what I'm afraid of. We should take time to look at FontForge's file format and how glyphs are pointing to each other amidst these features.

Ok, just to confirm : remapping a char by hand in FontForge breaks its kerning (the kerning info is removed completely). So it's probably better done with scripting, but one should be extra careful of how glyphs are pointing to each other with indexes.

laicasaane commented 6 years ago

I think each bearer tengwar should be mapped on a downcase char that has an upercase counterpart.

Apparently. In fact, all Latin blocks combined surely have more than enough uppercase-lowercase pairs to hold Tengwar. So you don't need to worry about this.

Layouting them will be complicated.

Could you briefly describe this issue concerning sa-rincer & ar-rincer in English mode? I don't understand what is complicated in re-mapping these glyphs.

one should be extra careful of how glyphs are pointing to each other with indexes

This could only be verified after actually re-mapping the glyphs of a font, is it?

BenTalagan commented 6 years ago

In fact, all Latin blocks combined surely have more than enough uppercase-lowercase pairs to hold Tengwar. So you don't need to worry about this.

I'm quite confident too on that point :)

Could you briefly describe this issue concerning sa-rincer & ar-rincer in English mode? I don't understand what is complicated in re-mapping these glyphs.

There is a large number of versions of them, which are not really normalized. Long, short, inclined or not, oriented left, oriented right, attached at the top, attached at the bottom. It's sometimes hard to know if a right oriented sa-rince has a left-oriented counterpart or not (and what appearance it should have).

It's unclear to me if multiple version of sa rincer could be combined or not. I haven't found any example of this in tolkien works. But there are cases when we are stuck, for example a word like axes in my english mode would tend to cumulate two sa-rincer on the x (k + sa_rince short + tehta + sa_rince long). I don't have a solution for handling this yet.

The flourished sa-rince is considered as a graphical variant of sa-rince by the FreeTengwar Font project and Everson's norm, but it's called ar-rince in some versions of the quenya writing system in the latest PE so it to my opinion it should better be considered as an independent tengwa.

Long sa-rincer may bear tehtar at the end of words (see the old english modes in Sauron Defeated). How do we handle complex combinations of multiple sa-rincer + tehtar?

Nothing else comes to my mind yet, but i'm pretty sure there are other problems that I can't recall. So to answer quickly your question, it's not more complicated to remap these chars than any other one, BUT there's (to my opinion) a problem of design concerning them which should be nice to solve if we decide to publish a new layout.

This could only be verified after actually re-mapping the glyphs of a font, is it?

Not sure to understand completely your question ; I think that because FontForge loses Kerning info when moving a glyph, we should not do it within FontForge, but with an external script. However, this means being really careful on how things are described in the sfd format, because glyphs are pointing to each others for various reasons (kerning is one). It seems they use their internal index to point to each other, so maybe the remapping can be done without breaking the internal indexes. I need to investigate more atm.

laicasaane commented 6 years ago

As I understand, the new layout really has nothing to do with this sa-rincer problem. Because I only suggest that we're going re-map all the existing glyphs in Dan Smith layout to a new layout which would ease our work on Tengwar document at first. Solving the composition of sa-rincer (and other marks) would be another project, and would come after, as it mainly involves in designing new glyphs or making new position for existing marks.

BenTalagan commented 6 years ago

As I understand, the new layout really has nothing to do with this sa-rincer problem. Because I only suggest that we're going re-map all the existing glyphs in Dan Smith layout to a new layout which would ease our work on Tengwar document at first. Solving the composition of sa-rincer (and other marks) would be another project, and would come after, as it mainly involves in designing new glyphs or making new position for existing marks.

I do not totally agree with that way of thinking. If we design a new layout, it should be thought to be able to solve the largest number of issues. So all these issues should be identified before designing the layout so that we're not stuck afterwards. See my remark about uppercase tengwar above : remapping should take into account that one day we may have uppercase chars ; so the new layout is affected by that non-existing feature in the sense that it's a good idea to map bearer tengwar on characters that have upcase counterparts. As well, even if we do not implement better versions of sa-rincer, it should still be thought of during the specification stage. E.G. where should we keep room for them? In the diacritic range of latin ? Or somewhere else?

One of the first step before developing anything is to identify all issues that should be resolved by the design of a new layout, and all features that should be proposed by a new layout.

By the way, I've started to write a remapper for sfd fonts. That should greatly help us to advance later on.

BenTalagan commented 6 years ago

Useful links regarding line breaking matters :

Unicode line breaking algorithm Unicode line breaking classes

Thus, concerning digits, we could easily extend our mapping to other ranges, like arabic numerical, as long as they belong to the 'NU' class.

laicasaane commented 6 years ago

Thanks for the links, I'll read them later. But here are some quick thoughts:

As well, even if we do not implement better versions of sa-rincer, it should still be thought of during the specification stage. E.G. where should we keep room for them? In the diacritic range of latin ? Or somewhere else?

To me this is really not a complicated problem. It' true that we should consider every problem before hand and leave place for non-existing glyphs. But there are plenty of rooms for this, we don't need to be very careful. For example, about the uppercase tengwar: if we intend to leave rooms for them, the number of existing marks that can be attached to a tengwar would be doubled; there would be 2 set of marks: one for lowercase tengwar, another for uppercase; and even if there are some unforseen glyphs might appear in the future (exotic glyphs for some languages?), in total it'd hardly exceed all the Latin ranges combined.

Thus, concerning digits, we could easily extend our mapping to other ranges, like arabic numerical, as long as they belong to the 'NU' class.

I really don't encourage the use of any range that's not Latin. Since the modern-and-powerful linguistic-capable functions of nowaday softwares can misinterpret and treat a specific block of text differently because to them, that block of texts is in another langage.

BenTalagan commented 6 years ago

To me this is really not a complicated problem. It' true that we should consider every problem before hand and leave place for non-existing glyphs. But there are plenty of rooms for this, we don't need to be very careful. For example, about the uppercase tengwar: if we intend to leave rooms for them, the number of existing marks that can be attached to a tengwar would be doubled; there would be 2 set of marks: one for lowercase tengwar, another for uppercase; and even if there are some unforseen glyphs might appear in the future (exotic glyphs for some languages?), in total it'd hardly exceed all the Latin ranges combined.

Yes, but ideally, I'd have liked to keep a matching between downcase and upcase tengwar. Like if a tengwa (let's say parma) is mapped on p it would be nice to have upcase parma on P. I've not made the exact count of bearer tengwar but I believe there are between 50 and 100. So the exhaustion goes quite fast if we want to keep this regular.

As well, there are a lot of tehtar, and ideally I'd have liked to have some kind of regularity (like all a tehtar would use variants of a- â, ä, etc) but it's really easy to exaust them, especially because in latin ranges there are not always all the variants of one diacritic for a, e, i, o, and u - e.g. double accent ő and ű exist but not the a,e,i counterparts, so this prevent us from using the double accent if we want to keep the regularity.

Also, it could be nice to take into account the fact that some fonts propose variants for ligature purposes.

I really don't encourage the use of any range that's not Latin. Since the modern-and-powerful linguistic-capable functions of nowaday softwares can misinterpret and treat a specific block of text differently because to them, that block of texts is in another langage.

There's simply no other choice for numbers. The only characters that belong to the numeric class in the latin range are digits so, according to the line breaking spec, there's no other way than using numeric chars from another range, and respecting the numeric class is to my opinion the cleanest thing to do. (By the way I've tested the following string 0123456789١٢٣٤٥٦٧٨٩ in firefox, chrome, safari, openoffice, they all take it as one entity and the wrapping works well).

BenTalagan commented 6 years ago

Hmm, I think there may be an alternative. If we make full use of the private area of the unicode range, it looks like line breaking will work flawlessly (to be verified more thoroughly). It means that, if a sequence of characters wholly belongs to the private use space, the sequence will be treated as a word. The documentation I've given above states :

[...] Unassigned code positions, private-use characters, and characters for which reliable line breaking information is not available are assigned this line breaking property. The default behavior for this class is identical to class AL (NB: alphabetic characters). Users can manually insert ZWSP or WORD JOINER around characters of class XX to allow or prevent breaks as needed. [...]

If this is indeed true, my suggestion is to stick as much as possible to the Free Tengwar Font mapping for the basic tengwar and extend that mapping for the multiple tehtar variants. It seems that it is already what Enrique Mombello have done in elfica if we look at the private use area carefully.

That would mean copying the tengwar from their initial place to the private use area taking care of not losing the kerning information. This seems to be a cool solution since the font would not lose its original mapping. But unfortunately that's not true. In the original DS mapping, punctuation signs are used to map some bearer tengwar, and to my opinion we should squash these slots to have the real elvish punctuation signs there (the Everson/FTF norm put them in the private use area, but maybe we should have them at the place of real punctuation signs for a better browser handling? This remark also stands for brackets, parenthesis, quotes, etc).

So, to conclude, such versions of the fonts would not be usable easily with a standard keyboard. It's a major drawback and I don't know if it's a good way to go or not - but still, I like the idea better than what we've discussed before, since we're not reinventing anything but only extending what already exists.

laicasaane commented 6 years ago

Using the PUA range is indeed the only option that would be most compatible with the nature of Unicode since it's not going to break any of Unicode rules. Initially I'd like to work on that option. But after considering the usability with modern softwares, I've noticed that PUA will drop us into a desert land where we can't easily (or possibly) take advantage of any tool we currently have, because the language functions are all built-in with the OS, there's no way for us to specify what is what in the PUA range for the OS and/or softwares to understand. And since my work mostly concern actual documents in Tengwar, not just some short texts, this time, I highly prefer the usefulness of the new Latin-based layout.

BenTalagan commented 6 years ago

I have renamed the current topic, it has derived from its original subject. I am creating a new issue for the other subject that was still on hold, #18.

BenTalagan commented 6 years ago

But after considering the usability with modern softwares, I've noticed that PUA will drop us into a desert land where we can't easily (or possibly) take advantage of any tool we currently have, because the language functions are all built-in with the OS, there's no way for us to specify what is what in the PUA range for the OS and/or softwares to understand.

Sorry for being reluctant :) , but I'm not convinced. Do you have any concrete examples where a full PUA solution would have some limitations compared to a latin solution? I'm getting more and more seduced by a solution that sticks to the FTF project mapping (or, a mixed solution, with tengwar in the PUA but punctuation deported to the latin blocks for example, but i'm not even sure that it's needed - the FTF mapping is already a mixed solution, some of the punctuation chars are outside of the PUA).

Having some examples would help me a lot understand your feeling.

By the way, I have finished writing the remapping tool for sfd/FontForge files (with copy/move/delete directives and kerning conservation) so technically we're good, and we can really focus on the layout debate.