Alexey-T / CudaText

Cross-platform text editor, written in Free Pascal
Mozilla Public License 2.0
2.4k stars 167 forks source link

Include punctuation chars as exceptions in default value for `links_regex` #5456

Closed pintassilgo closed 1 month ago

pintassilgo commented 3 months ago

I guess there's no official standard, any char can be part of the URL. But some, when appearing at the end, are usually treated as boundary and not included in links.

Above you see markdown rules, used by GitHub and many popular webpages and webapps. I just typed the URLs as plain text and they were automatically parsed to generate links when I submitted this comment.

You can choose which set of chars to escape, but there are some important ones that are universally ignored as part of the URL, Cuda must remove them from URL:

From what I see, currently only the last tree are escaped in Cuda.

If it was me to decide, I'd follow Sublime rules. so these ones would be exceptions too:

But I understand if you disagree.

My main request here is to add ), ,, ., `, and ; as exceptions by default just like ' and " already are.

user.json line updated with the main request chars added:

"links_regex": "\\b(mailto:)?\\w[\\w\\-\\+\\.]*@\\w[\\w\\-\\.]*\\.\\w{2,}\\b|\\b(https?://|ftp://)\\w[\\w\\-\\.@]*(:\\d+)?(/([~\\w\\.\\-\\+\\/%@!%]|\\(.*?\\))*)?(\\?[^<>'\"),.`;\\s]+)?(\\#[\\w\\-\\./%:!]*)?",

the added part was ),.`;

Alexey-T commented 3 months ago

So you fully considered needed chars by "links_regex"? No manual work needed?

pintassilgo commented 3 months ago

Sorry, I didn't understand your comment.

I'm proposing to update the default value of links_regex adding, at least, ),.`; as chars that must not be included in URL when appearing at the right edge.

Maybe some other chars too, like “”’‘?!. I would include these too, but I understand if you disagree.

pintassilgo commented 3 months ago

Actually, I thought a little more about it and I believe ? and ! are also a must.

So my proposed updated version for default.json:

"links_regex": "\\b(mailto:)?\\w[\\w\\-\\+\\.]*@\\w[\\w\\-\\.]*\\.\\w{2,}\\b|\\b(https?://|ftp://)\\w[\\w\\-\\.@]*(:\\d+)?(/([~\\w\\.\\-\\+\\/%@!%]|\\(.*?\\))*)?(\\?[^<>'\"),.`;!?\\s]+)?(\\#[\\w\\-\\./%:!]*)?",

The added part compared to current release is ),.`;!?.

If you agree to also add “”’‘, great, but I'm fine if you reject. These ones aren't important.

Alexey-T commented 3 months ago

applied your fix, thanks. about “”’‘. they are not ASCII so they are hard to include to ASCII pascal code. or maybe not. yet I missed them.

pintassilgo commented 3 months ago

Thanks. You can close this whenever you want, but I believe you also intend to update default.json to reflect the change you made.

Alexey-T commented 3 months ago

updated default.json too. closing.

pintassilgo commented 2 months ago

I just noticed a new issue on this topic (suggest to reopen).

Sometimes there are numbers between dot in URL, usually representing IP. Cuda is breaking those links because of the dot. The same applies to comma and others, but . and , are the most affected.

The code should be improved to only stop the link when these chars are followed by \s.

Try pasting this in Cuda:

https://example.com/?180.200.208.36
https://example.com/?15,50
https://example.com/?a)a
https://example.com/?a]a
https://example.com/?a>a
https://example.com/?a'a
https://example.com/?a"a
https://example.com/?a`a
https://example.com/?a;a

Then you can think about where the link should break for each one.

Current results: image

., , and ; surely need to be fixed. Others I'm not sure.

Let's also see what is markdown behavior:

https://example.com/?180.200.208.36 https://example.com/?15,50 https://example.com/?a)a https://example.com/?a]a https://example.com/?a>a https://example.com/?a'a https://example.com/?a"a https://example.com/?a`a https://example.com/?a;a

So markdown follows all links to the end...

Edit: back to initial report, I believe : should also be removed from URL when the char appears at the end.

Alexey-T commented 2 months ago

Made the fix. now it's better? http://uvviewsoft.com/c/

pintassilgo commented 2 months ago

Yes, fixed the cases from my previous comment and also escaped almost all the remaining cases from initial report, becoming more similar to Sublime. Thanks.

- and + should be included in the URL when it's the last char, what do you think? They are commonly used in some encodings. By fixing that, I guess we're done.

Should link the entire line: https://example.com/?ok+ Also: https://example.com/?ok-

Alexey-T commented 2 months ago

Fixed more, for plus/minus chars.

pintassilgo commented 1 month ago

More: http://example.com/A&E

Is a complete link for Markdown, VSCode, Sublime... but not for Cuda, in which link currently ends before &.

Edit:

More:

Full link in all the three (Markdown, VSCode and Sublime): http://example.com/A*e http://example.com/A=e http://example.com/A{e http://example.com/A[e http://example.com/A$e http://example.com/A(e http://example.com/A|e http://example.com/A;e http://example.com/A,e

Full link in Markdown and Sublime, but not for VSCode (at least }]) should be fixed, because {[( are parsed): http://example.com/A}e http://example.com/A]e http://example.com/A)e http://example.com/A'e http://example.com/A"e http://example.com/A`e http://example.com/A"e

Alexey-T commented 1 month ago

Fixed, thanks.

pintassilgo commented 1 month ago

Last fix broke parsing links in markdown format, example:

[kdotool](https://github.com/jinliu/kdotool/releases/latest/).

Link should end in last /, but Cuda is including )..

Some closing chars such as ]})"'" and also ` should not be included in link when following char is a word delimiter such as space, dot, comma, linebreak... .,;:\n.

Edit: other example of this issue:

httpChannel.setRequestHeader('Referer', 'https://www.google.com.br/', false);

In Cuda, the link is including ', instead of ending in /.

Alexey-T commented 1 month ago

Thanks for notice, will see how to fix the regex.

Alexey-T commented 1 month ago

Fixed. will change in default.json soon.