deltachat / message-parser

Parsing of Links, Email adresses, simple text formatting (markdown subset), user mentions, hashtags and more in DeltaChat messages.
https://deltachat.github.io/message-parser/
Other
12 stars 2 forks source link

markdown - spec - current implementation of `_` is problematic #23

Open Simon-Laux opened 1 year ago

Simon-Laux commented 1 year ago

undercore _ is used often in text. especially for programmers having things like ANDROID_NDK_ROOT in a message is not a rare occurrence.

currently ANDROID_NDK_ROOT parses into

Bildschirmfoto 2023-01-27 um 01 29 03
[
    {
        "t": "Text",
        "c": "ANDROID"
    },
    {
        "t": "Italics",
        "c": [
            {
                "t": "Text",
                "c": "NDK"
            }
        ]
    },
    {
        "t": "Text",
        "c": "ROOT"
    }
]

which is not desired in most cases.

Proposed solution

remove underscore variants from bold and italics, so its just *italics* and **bold**

r10s commented 1 year ago

as we are currently not handling markdown styles at all, it may be okay to support a subset only - but it may still be regarded as a bug and ppl may be wondering.

are there other notable markdown implementations that skip underscore processing? is this maybe a usual thing meanwhile?

i am not too much for/in markdown at all, however, removing such basic thing as _italic_ seems questionable: underscores are esp. nice as italics/underlining are typographically interchangeable - and by _italic_ one has kind of wysiwyg in plaintext. i see that use of underscore in plaintext/markdown very often, it feels natural. many markdown toolbars also use that variant for italics.

and: hasn't sth. as width*height*deep (not to speak about C dereferencing :) the same issue? usually programmers know they have to backtick their stuff, cmp. eg. our changelogs :)

maybe also the parser can maybe do better, eg. allowing _ on word boundaries only (which seems to be a general difference to the asterix - also here on github, inside words asterix is changing style - where underscore is not)

r10s commented 1 year ago

in general, if we do not want to rely on parsing rules as there are too many cornercases (markdown is only apparently simple, reminds me very much to yaml :) we could also leave the formatting characters in the resulting text.

i have seen this here and there, and think this not the worse compromise, also makes adaption easier and a bad rule does not break things badly:

imagine passwords sent over delta chat with markdown support: this may be close to impossible, at least unreliable, if we introduce markdown

Simon-Laux commented 1 year ago

but it may still be regarded as a bug and ppl may be wondering.

we define what we want to support in https://github.com/deltachat/message-parser/blob/master/spec.md

maybe also the parser can maybe do better, eg. allowing _ on word boundaries only (which seems to be a general difference to the asterix - also here on github, inside words asterix is changing style - where underscore is not)

sure that's an option, but also more complicated than just removing underscore or saying that only on wordboundries is allowed. a side benefit of only on word boundaries is that it could make the parsing slightly faster.

in general, if we do not want to rely on parsing rules as there are too many cornercases (markdown is only apparently simple, reminds me very much to yaml :) we could also leave the formatting characters in the resulting text.

also a valid angle, but does not look as good. Maybe rich formatting really needs a special editor that auto escapes stuff you paste into it, but that's complicated and even element doesn't get it right (I had problems with it).

imagine passwords sent over delta chat with markdown support: this may be close to impossible, at least unreliable, if we introduce markdown

I think element auto-escapes stuff you paste into it, another option would be to allow users to toggle markdown rendering in their messages... but right there are more open questions to the whole rich message topic, also wether we want to send messages as html or not.

Simon-Laux commented 1 year ago

another example I came across: #deltachat_desktop becomes #deltachatdesktop, with desktop being displayed in italics font.

Simon-Laux commented 10 months ago

another example I came across: #deltachat_desktop becomes #deltachatdesktop, with desktop being displayed in italics font.

the text emote -_- can also falsely trigger underline. again I think we should just remove it from the spec.

Simon-Laux commented 8 months ago

just tested WhatsApp, it has _ for italics and * for bold, but is more clever about it:

_hi_ - triggers italics _hi_hi - does not trigger italics _hi_hi_ - triggers italics _hi _hi_ - triggers italics _hi _hi _ - does not trigger italics

so italics only works when started and ended outside of a word. Which could also be a solution for us here.

adbenitez commented 2 months ago

another bug in the implementation of _ is:

_[test](https://example)_

which should be displayed as:

test

is actually displayed as:

[test](https://example)_

(the [ is italic while the rest of the link is not parsed)

link2xt commented 2 months ago

This is also how pandoc handles it:

pandoc -f markdown -t html
_foo_bar_baz_
<p><em>foo_bar_baz</em></p>
pandoc -f markdown -t html
foo_bar_baz
<p>foo_bar_baz</p>

CommonMark has a very precise definition: https://spec.commonmark.org/0.31.2/#emphasis-and-strong-emphasis