[Bug]: Maximum characters count mismatch between Tuba and the Mastodon web UI

nekohayo commented 1 year ago

Describe the bug

Tuba seems to have different ways to calculate the maximum allowable toot length compared to what Mastodon.social's web interface actually accepts.

Steps To Reproduce

Try writing those in the mastodon.social interface vs in Tuba:

Case 1:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Vitae turpis massa sed elementum tempus egestas. Pharetra massa massa ultricies mi. Turpis massa sed elementum tempus egestas. Et pharetra pharetra massa massa ultricies. Consectetur purus ut faucibus pulvinar elementum integer enim neque volutpat. Orci porta non pulvinar neque laoreet suspendisse interdum. https://example.com Quisque sagittis purus sit amet volutpat co

Result: Mastodon social says 0 characters left, Tuba says 4 characters left. The URL is counted differently.

Case 2: this toot ; Mastodon's web interface for that instance happily allowed me to edit/post it at that length, but if you paste that into Tuba, Tuba would think it's breaching the limit at -52 characters, and prevents you from posting it.

The result of this discrepancy is that I've been unknowingly forced to abbreviate my toots more than I need to (and sometimes that meant losing important content/context), or that I still need to use the web interface to post things.

Logs and/or Screenshots

No response

Instance Backend

Mastodon

Operating System

Fedora 38

Package

Flatpak

Troubleshooting information

os: GNOME 44 (Flatpak runtime) prefix: /app flatpak: true version: 0.4.1 (production) gtk: 4.10.4 (4.10.4) libadwaita: 1.3.3 (1.3.3) libsoup: 3.4.2 (3.4.2) libgtksourceview: 5.8.0 (5.8.0)

Additional Context

No response

GeopJr commented 1 year ago

It's the links (as you mentioned), on mastodon every link counts as a static length (on .social it's 23). Not sure what's the best way to do it, running a regex against the content on keypress?

edit: will have to see if it's worth it performance-wise to check if the content has / or/and . before running the regex

GeopJr commented 11 months ago

Hmm, this is messier than expected... Mastodon apparently also ignores the domain part of the mentions [at]mastodon[at]mastodon.social counts as [at]mastodon. Needs more investigation on other backends

MENTION REGEX: https://github.com/mastodon/mastodon/blob/774e1189d26fffd914107a4236f6287043c988f8/app/services/account_search_service.rb#L6

URL REGEX: https://github.com/mastodon/mastodon/blob/774e1189d26fffd914107a4236f6287043c988f8/app/services/fetch_link_card_service.rb#L7-L15

GeopJr commented 10 months ago

on further research, every client just does its own thing at this point...

Mastodon has a very specialized regex that's provided from an npm package that was maintained by twitter. That regex includes every valid TLD. https://google.com will match but https://google.foobar won't. Mastodon will count it as the defined amount of characters even if it has less (https://a.com/ will be counted as 23 even if it has less chars)

elk mistook (?) another regex for the url one so it counts any string that starts with https or xmpp (<- mastodon doesnt shorten xmpp uris). It also only sets the value to the defined one (23) if it's longer than that (https://a.com will count as 13)

I don't think we can have such a specialized regex as mastodon's but at least one close to it

GeopJr commented 6 months ago

(I forgot to do the mentions)

GeopJr commented 6 months ago

I'll mention it here for future reference:

There are 2 stages (3 when I do mentions):

Replacing the urls in the content with the instance set amount of characters it should count as (test https://gnome.org/ => test XXXXXXXX)
Using ICU count the total characters

Url parsing is being done using GLib.Uri and NOT the mastodon regex. The regex is huge and needs constant maintenance, as seen above, due to it containing every TLD in existence. That means that it may not match mastodon's behavior 1:1 but it will match the spec.

ICU is needed for counting graphemes. "🏳️‍⚧️".length == 6 but using ICU it will count as 1 character

TODO:

mentions
use ICU in other fields like alt text

(this is so cursed)

nekohayo commented 6 months ago

Related but apparently not the same exact same cause: #817

nekohayo commented 2 months ago

Needs to be targetted to 0.8.x?

GeopJr / Tuba