codemirror / codemirror5

In-browser code editor (version 5, legacy)
http://codemirror.net/5/
MIT License
26.76k stars 4.96k forks source link

Bad letter boundary detection for complex scrips #2115

Open santhoshtr opened 10 years ago

santhoshtr commented 10 years ago

Paste the following text to brackets, and see where the cursor is placed

സന്തോഷ്

Cursor is supposed to place at end of the word, but in brackets it is after 4 or 5 character width.

Happens with all non-latin complex scripts

Works fine in Firefox, but issue exist in chrome.

brackets

(duplicated from https://github.com/adobe/brackets/issues/6301)

marijnh commented 10 years ago

This is a case of CodeMirror's simplistic grapheme cluster algorithm not handling the language. Unfortunately, JavaScript does not provide the primitives needed to do sane cluster-boundary detection (finding character properties, etc).

Happens with all non-latin complex scripts

Not all. Some, like Arabic, should work.

santhoshtr commented 10 years ago

This is a case of CodeMirror's simplistic grapheme cluster algorithm not handling the language. Unfortunately, JavaScript does not provide the primitives needed to do sane cluster-boundary detection (finding character properties, etc).

I would like to understand it a bit more. What exact algorithm you need to place the cursor at a logically correct position? If we want to support a lot of languages, we should leave this kind of primitive functionality to browsers. Trying to imitate such behavior will reach no where.

Also. how Chrome gives different output than Firefox in this case?

marijnh commented 10 years ago

To know how to move the cursor through a text, and which ranges of codepoints to use when measuring character positions, CodeMirror needs to know where clusters start and end.

The browser knows this, but doesn't expose this information to JavaScript. Telling me that what I'm doing "will reach no where" without actually understanding the problem isn't really the right tone to take here.

santhoshtr commented 10 years ago

I have faced the cursor movement, logical cluster issues in the development of Visual Editor for Wikimedia. Thought of understanding the problem in detail so that I might be able to help. Will check later, don't have time to find out the details now. Thanks.

marijnh commented 10 years ago

Attached patch fixes some known problems with handling of extending code points, and appears to help with #2125 (Hindi), but does not fix your example.

I will need some input from someone who is familiar with this language's Unicode encoding, because the behavior of this string baffles me. Characters "ന്തോ" act as a single unit, as far as cursor movement is concerned, but only the second code point in that string is an extending character. If I read the document at http://www.unicode.org/reports/tr29/ correctly, this should count as three grapheme clusters, not one. What is going on?

peterflynn commented 10 years ago

CC'ing @pauldhunt and @miguelsousa, who have worked on some of Adobe's open-source typography efforts -- just in case they have any quick insights to share :-)

Jaygiri commented 10 years ago

I have removed my previous comment.

This language is Malayalam. Fix for #2125 is not fixing positioning for this language.

santhoshtr commented 10 years ago

Characters "ന്തോ" act as a single unit, as far as cursor movement is concerned, but only the second code point in that string is an extending character. If I read the document at http://www.unicode.org/reports/tr29/ correctly, this should count as three grapheme clusters, not one. What is going on?

You cannot rely on TR29 for getting grapheme clusters for the purpose of the counting or cursor movement. TR29 clearly explains this. You have to use tailored logic to meet your purpose. That too is not enough since in Indic scripts, depdending on the font, multiple consonants with the help of a joining character like VIRAMA can create single ligatures. Sometime stacking of characters happens. Chrome and FF does not agree on the implementation of character movement on Indic scripts. Chrome allows you to move your cursor as per logical boundaries. FF also follow the same rule, but FF allows placing cursor if you try to do it using a program. You have to ask the browser whether you can place a cursor here or not. Iterating that question over a range of text will give you a reliable cursor placing positions. This can be used for creating a stack of edits useful for undo redo etc.

marijnh commented 10 years ago

By 'ask the browser' you mean create a textarea and try to set the cursor in the textarea there? Or is there a more efficient/convenient way to do it on (non-editable) DOM nodes?

Is there an easy/cheap way to determine whether a string might have stacking?

santhoshtr commented 10 years ago

By 'ask the browser' you mean create a textarea and try to set the cursor in the textarea there? Or is there a more efficient/convenient way to do it on (non-editable) DOM nodes?

Yes, create an editable node and keep on trying to place cursor. Of course it is inefficient and hacky.

Is there an easy/cheap way to determine whether a string might have stacking?

No, that is not possible. It not only depends on the data but also the font used.

marijnh commented 10 years ago

Is there an easy/cheap way to determine whether a string might have stacking?

No, that is not possible. It not only depends on the data but also the font used.

Well, I meant a way to weed out strings that obviously don't need the expensive treatment, and simply have a cursor position between every code point. /[^\x00-\x7f]/ would work to spot ascii strings, but maybe we can do better, and enumerate the ranges of the languages in this occurs (by using broad ranges to keep the string size under control, false positives aren't bad).

marijnh commented 10 years ago

@santhoshtr

Yes, create an editable node and keep on trying to place cursor. Of course it is inefficient and hacky.

On Firefox, it seems that selectionEnd can be set to any value, even one that's not a valid cursor position. Do you have any example of this technique actually being applied?

marijnh commented 10 years ago

(That is, I'm using a textarea now, because there i can play with selectionEnd without actually breaking the existing selection in the document. Using getSelection().addRange() is just too horribly disruptive—will cause tons of side effects on mobile, and also cause spurious deselects/reselects on desktop.)

ghost commented 10 years ago

@marijnh Arabic doesn't work correctly same as Thai.

peterkroon commented 10 years ago

@marijnh https://github.com/marijnh/CodeMirror/issues/2115#issuecomment-31731752

The browser knows this, but doesn't expose this information to JavaScript.

Have you considered filing a bug for this at https://bugzilla.mozilla.org/ or https://code.google.com/p/chromium/

alicoding commented 10 years ago

Wondering if there is any update or workaround to this bug yet?

marijnh commented 10 years ago

Nope, I still haven't found a hack that works halfway acceptably.

niftylettuce commented 8 years ago

I still have same issue, if you set a custom font, like Inconsolata, the line height or cursor positioning is way off (until you start to make some interaction/typing/clicking in the textarea rendered into .CodeMirror class.

niftylettuce commented 8 years ago

screen shot 2016-01-14 at 1 24 09 am screen shot 2016-01-14 at 1 24 03 am

sadig41 commented 6 years ago

Can't make RTL for arabic?

adrianheine commented 6 years ago

This is a issue that's difficult if not impossible to solve with the fundamental approach currently taken by CodeMirror.

We are working on a rewrite (CodeMirror 6) that might address this issue, and we are currently raising money for this work: See the announcement for more information about the rewrite and a demo.

Note that CodeMirror 6 is by no means stable or usable in production, yet. It's highly unlikely that we pick up this issue for CodeMirror 5, though.

HTGAzureX1212 commented 3 years ago

Same issue here, the cursor seem to be completely mispositioned... I have used codeMirror.getDoc().setValue() though. image

Windows 10 1909 Chrome 86.0.4240.111