damus-io / damus

iOS nostr client
GNU General Public License v3.0
1.95k stars 291 forks source link

Hashtag contains linked punctuation #1518

Closed dmnyc closed 8 months ago

dmnyc commented 9 months ago

At the end of the hashtag I included a … character which was not intended to be linked to the tag but Damus included it anyway.

note ID: note1a8y42nfmeyz4gnvmjxta8k2l8dzl4sa9y6m726aa976ludlzq56sjde5dg

image

alltheseas commented 9 months ago

Did you add a space between your intended hashtag and the subsequent characters?

dmnyc commented 9 months ago

No, but hashtags should never include punctuation. I put other punctuation at the end like periods and exclamation marks and they don’t cause this issue.

On Mon, Aug 28, 2023 at 12:43 AM alltheseas @.***> wrote:

Did you add a space between your intended hashtag and the subsequent characters?

— Reply to this email directly, view it on GitHub https://github.com/damus-io/damus/issues/1518#issuecomment-1694998409, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3YJQYEJER5MPIOVOXAFZDXXQOYPANCNFSM6AAAAAA377JYEM . You are receiving this because you authored the thread.Message ID: @.***>

alltheseas commented 9 months ago

I put other punctuation at the end like periods and exclamation marks and they don’t cause this issue.

On Damus? On Twitter?

dmnyc commented 9 months ago

Either one, but here’s an example.

The tag #zapathon is followed by a period but it’s not linked.

note1gcv2lskpdf4umthlfrjsqdtplvqmv8mpt3lzwaku405lcz9pnuysyf8sjz

On Mon, Aug 28, 2023 at 9:22 AM alltheseas @.***> wrote:

I put other punctuation at the end like periods and exclamation marks and they don’t cause this issue.

On Damus? On Twitter?

— Reply to this email directly, view it on GitHub https://github.com/damus-io/damus/issues/1518#issuecomment-1695691968, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3YJQ6XRAKD2GZTVCGQCFTXXSLPNANCNFSM6AAAAAA377JYEM . You are receiving this because you authored the thread.Message ID: @.***>

alltheseas commented 9 months ago

Thanks, that helps.

what happens

Hash tag with one period post hashtag is not added to hashtag.

Hash tag with two periods post hashtag is not added to hashtag.

Hash tag with three periods post hashtag is added to hashtag.

Hash tag with four periods post hashtag: three are added to hashtag, one is not.

image

image

Cannot replicate with commas

image

dmnyc commented 9 months ago

Three periods as an ellipsis character … is what I believe caused the issue.

dmnyc commented 9 months ago

note12u7dx9gcm3tapau7ljqk7ymkc2lz7zkh0u3w85ndsed7s3kpl8ys943vz7

jonmarrs commented 9 months ago

Here are some examples of how hashtags are displayed with the patch I am proposing (https://github.com/jonmarrs/damus/commit/336215a78f51cfbb985d1cd86d4c5b52e3da1619).

image image image image

Let me know if you would like me to run any other tests.

jonmarrs commented 9 months ago

note12u7dx9gcm3tapau7ljqk7ymkc2lz7zkh0u3w85ndsed7s3kpl8ys943vz7

Here is how this note looks with my proposed code change:

image
dmnyc commented 9 months ago

Yes, that’s the expected behavior.

alltheseas commented 9 months ago

Here are some examples of how hashtags are displayed with the patch I am proposing (jonmarrs@336215a).

image image image image Let me know if you would like me to run any other tests.

Thanks @jonmarrs

Have you reviewed the contributing guidelines?

dmnyc commented 9 months ago

Looks perfect.

jonmarrs commented 9 months ago

Here are some examples of how hashtags are displayed with the patch I am proposing (jonmarrs@336215a). image image image image Let me know if you would like me to run any other tests.

Thanks @jonmarrs

Have you reviewed the contributing guidelines?

Thanks @alltheseas

I read the contributing guidelines, and submitted the patch to patches@damus.io using git send-email.

Now I'm working on running some test cases.

alltheseas commented 9 months ago

@jonmarrs if you join the dev chat, it can be helpful sometimes to ping questions with the other devs

https://simplex.chat/contact#/?v=1-4&smp=smp%3A%2F%2FenEkec4hlR3UtKx2NMpOUK_K4ZuDxjWBO1d9Y4YXVaA%3D%40smp14.simplex.im%2Fi_G-OJedYDl0fRlQbYjG0SUAn0cLmvzq%23%2F%3Fv%3D1-2%26dh%3DMCowBQYDK2VuAyEAxVsOa6uphThfVq0u0uYRqNFZkv2W2UhoPR3lHnu0gAo%253D%26srv%3Daspkyu2sopsnizbyfabtsicikr2s4r3ti35jogbcekhm3fsoeyjvgrid.onion&data=%7B%22type%22%3A%22group%22%2C%22groupLinkId%22%3A%223imMonp-dw_d0-QIyXbRnQ%3D%3D%22%7D

jonmarrs commented 9 months ago

Three periods as an ellipsis character … is what I believe caused the issue.

Just wanted to give a quick update. I've been looking at how "..." is encoded in Damus, and it is indeed being encoded as an ellipsis rather than three separate periods. Apparently the ellipsis is encoded in UTF-8 as 0xE2 0x80 0xA6 (https://www.compart.com/en/unicode/U+2026). So those three hex values (0xE2 0x80 0xA6) that encode the ellipsis are not being detected as punctuation.

image

Perhaps we should check for an ellipsis when parsing a hashtag.

jb55 commented 9 months ago

On Thu, Aug 31, 2023 at 04:20:30PM -0700, Jon Marrs wrote:

Three periods as an ellipsis character … is what I believe caused the issue.

Just wanted to give a quick update. I've been looking at how "..." is encoded in Damus, and it is indeed being encoded as an ellipsis rather than three separate periods. Apparently the ellipsis is encoded in UTF-8 as 0xE2 0x80 0xA6 (https://www.compart.com/en/unicode/U+2026). So those three hex values (0xE2 0x80 0xA6) that encode the ellipsis are not being detected as punctuation.

image

Perhaps we should check for an ellipsis when parsing a hashtag.

yeah we might need to be smarter about checking punctuation, since utf-8 chars are allowed in hashtags and we definitely want to keep that.

jonmarrs commented 9 months ago

We could try to detect this block of unicode punctuation (https://www.compart.com/en/unicode/block/U+2000), which contains the ellipsis and other punctuation marks. All these unicode punctuation marks start with 0xE2, and the middle hex value is either 0x80 or 0x81.

jonmarrs commented 9 months ago

Here is my proposed solution for selectively filtering UTF-8 punctuation: https://github.com/damus-io/damus/compare/master...jonmarrs:damus:2023-08-hashtag-linked-punctuation

jonmarrs commented 9 months ago

My new code passes the hashtag test cases. UTF-8 characters are included in hashtags, but UTF-8 punctuation is removed.

image
dmnyc commented 9 months ago

Out of curiosity, is there any reason to include non-alphanumeric characters in hashtags?

jonmarrs commented 9 months ago

We probably want to enable the broadest use of hashtags as possible, in as many languages as possible. Filtering out the few things we don't want in hashtags will be a more efficient strategy than checking for the many things we do want in hashtags. We want to support hashtags with alphanumeric characters in both ASCII and UTF-8. So I think we should only try to detect ASCII punctuation, UTF-8 punctuation, and whitespace. Basically everything else will be included. Although there is a slight inconsistency in handling currency symbols.

image
jonmarrs commented 9 months ago

Looks like Twitter / X also leaves out currency symbols.

image
jonmarrs commented 9 months ago

Should we filter out currency symbols from hashtags as well?

jonmarrs commented 9 months ago

It seems like the overall strategy should be to decide which UTF-8 categories/blocks we want to filter out from hashtags. Here is a list of unicode blocks: https://www.compart.com/en/unicode/block

Right now I am filtering out General Punctuation. I could also filter out Currency Symbols. Anything else I should filter out?

image
jonmarrs commented 9 months ago

Here are some more test results.

Twitter: https://twitter.com/Martian_BTC/status/1697676619530506353?s=20 image

Damus (iPhone app): note1l6x97pcynhvrfzpfg90pnkxa9tl3dhtwek3dgdray2pqmhr6x3kqadea4z image

My patch as it is now:

image

Clearly there are some inconsistencies. As far as I know, there is no ISO standard for hashtags. Should the goal be for Damus to replicate how Twitter/X handles hashtags as closely as possible?

jb55 commented 9 months ago

Clearly there are some inconsistencies. As far as I know, there is no ISO standard for hashtags. Should the goal be for Damus to replicate how Twitter/X handles hashtags as closely as possible?

probably, theirs works quite well.

jonmarrs commented 8 months ago

Please review my pull request (https://github.com/damus-io/damus/pull/1546), which I believe closes this issue.