Closed dmnyc closed 8 months ago
Did you add a space between your intended hashtag and the subsequent characters?
No, but hashtags should never include punctuation. I put other punctuation at the end like periods and exclamation marks and they don’t cause this issue.
On Mon, Aug 28, 2023 at 12:43 AM alltheseas @.***> wrote:
Did you add a space between your intended hashtag and the subsequent characters?
— Reply to this email directly, view it on GitHub https://github.com/damus-io/damus/issues/1518#issuecomment-1694998409, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3YJQYEJER5MPIOVOXAFZDXXQOYPANCNFSM6AAAAAA377JYEM . You are receiving this because you authored the thread.Message ID: @.***>
I put other punctuation at the end like periods and exclamation marks and they don’t cause this issue.
On Damus? On Twitter?
Either one, but here’s an example.
The tag #zapathon is followed by a period but it’s not linked.
note1gcv2lskpdf4umthlfrjsqdtplvqmv8mpt3lzwaku405lcz9pnuysyf8sjz
On Mon, Aug 28, 2023 at 9:22 AM alltheseas @.***> wrote:
I put other punctuation at the end like periods and exclamation marks and they don’t cause this issue.
On Damus? On Twitter?
— Reply to this email directly, view it on GitHub https://github.com/damus-io/damus/issues/1518#issuecomment-1695691968, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3YJQ6XRAKD2GZTVCGQCFTXXSLPNANCNFSM6AAAAAA377JYEM . You are receiving this because you authored the thread.Message ID: @.***>
Thanks, that helps.
what happens
Hash tag with one period post hashtag is not added to hashtag.
Hash tag with two periods post hashtag is not added to hashtag.
Hash tag with three periods post hashtag is added to hashtag.
Hash tag with four periods post hashtag: three are added to hashtag, one is not.
Cannot replicate with commas
Three periods as an ellipsis character … is what I believe caused the issue.
note12u7dx9gcm3tapau7ljqk7ymkc2lz7zkh0u3w85ndsed7s3kpl8ys943vz7
Here are some examples of how hashtags are displayed with the patch I am proposing (https://github.com/jonmarrs/damus/commit/336215a78f51cfbb985d1cd86d4c5b52e3da1619).
Let me know if you would like me to run any other tests.
note12u7dx9gcm3tapau7ljqk7ymkc2lz7zkh0u3w85ndsed7s3kpl8ys943vz7
Here is how this note looks with my proposed code change:
Yes, that’s the expected behavior.
Here are some examples of how hashtags are displayed with the patch I am proposing (jonmarrs@336215a).
Let me know if you would like me to run any other tests.
Thanks @jonmarrs
Have you reviewed the contributing guidelines?
Looks perfect.
Here are some examples of how hashtags are displayed with the patch I am proposing (jonmarrs@336215a). Let me know if you would like me to run any other tests.
Thanks @jonmarrs
Have you reviewed the contributing guidelines?
Thanks @alltheseas
I read the contributing guidelines, and submitted the patch to patches@damus.io using git send-email.
Now I'm working on running some test cases.
@jonmarrs if you join the dev chat, it can be helpful sometimes to ping questions with the other devs
Three periods as an ellipsis character … is what I believe caused the issue.
Just wanted to give a quick update. I've been looking at how "..." is encoded in Damus, and it is indeed being encoded as an ellipsis rather than three separate periods. Apparently the ellipsis is encoded in UTF-8 as 0xE2 0x80 0xA6 (https://www.compart.com/en/unicode/U+2026). So those three hex values (0xE2 0x80 0xA6) that encode the ellipsis are not being detected as punctuation.
Perhaps we should check for an ellipsis when parsing a hashtag.
On Thu, Aug 31, 2023 at 04:20:30PM -0700, Jon Marrs wrote:
Three periods as an ellipsis character … is what I believe caused the issue.
Just wanted to give a quick update. I've been looking at how "..." is encoded in Damus, and it is indeed being encoded as an ellipsis rather than three separate periods. Apparently the ellipsis is encoded in UTF-8 as 0xE2 0x80 0xA6 (https://www.compart.com/en/unicode/U+2026). So those three hex values (0xE2 0x80 0xA6) that encode the ellipsis are not being detected as punctuation.
Perhaps we should check for an ellipsis when parsing a hashtag.
yeah we might need to be smarter about checking punctuation, since utf-8 chars are allowed in hashtags and we definitely want to keep that.
We could try to detect this block of unicode punctuation (https://www.compart.com/en/unicode/block/U+2000), which contains the ellipsis and other punctuation marks. All these unicode punctuation marks start with 0xE2, and the middle hex value is either 0x80 or 0x81.
Here is my proposed solution for selectively filtering UTF-8 punctuation: https://github.com/damus-io/damus/compare/master...jonmarrs:damus:2023-08-hashtag-linked-punctuation
My new code passes the hashtag test cases. UTF-8 characters are included in hashtags, but UTF-8 punctuation is removed.
Out of curiosity, is there any reason to include non-alphanumeric characters in hashtags?
We probably want to enable the broadest use of hashtags as possible, in as many languages as possible. Filtering out the few things we don't want in hashtags will be a more efficient strategy than checking for the many things we do want in hashtags. We want to support hashtags with alphanumeric characters in both ASCII and UTF-8. So I think we should only try to detect ASCII punctuation, UTF-8 punctuation, and whitespace. Basically everything else will be included. Although there is a slight inconsistency in handling currency symbols.
Looks like Twitter / X also leaves out currency symbols.
Should we filter out currency symbols from hashtags as well?
It seems like the overall strategy should be to decide which UTF-8 categories/blocks we want to filter out from hashtags. Here is a list of unicode blocks: https://www.compart.com/en/unicode/block
Right now I am filtering out General Punctuation. I could also filter out Currency Symbols. Anything else I should filter out?
Here are some more test results.
Twitter: https://twitter.com/Martian_BTC/status/1697676619530506353?s=20
Damus (iPhone app): note1l6x97pcynhvrfzpfg90pnkxa9tl3dhtwek3dgdray2pqmhr6x3kqadea4z
My patch as it is now:
Clearly there are some inconsistencies. As far as I know, there is no ISO standard for hashtags. Should the goal be for Damus to replicate how Twitter/X handles hashtags as closely as possible?
Clearly there are some inconsistencies. As far as I know, there is no ISO standard for hashtags. Should the goal be for Damus to replicate how Twitter/X handles hashtags as closely as possible?
probably, theirs works quite well.
Please review my pull request (https://github.com/damus-io/damus/pull/1546), which I believe closes this issue.
At the end of the hashtag I included a … character which was not intended to be linked to the tag but Damus included it anyway.
note ID: note1a8y42nfmeyz4gnvmjxta8k2l8dzl4sa9y6m726aa976ludlzq56sjde5dg