baedert / corebird

Native Gtk+ Twitter Client
https://corebird.baedert.org
GNU General Public License v3.0
791 stars 78 forks source link

New tweet parsing doesn't handle ccTLDs #774

Closed IBBoard closed 6 years ago

IBBoard commented 6 years ago

I'm running Git Master from Saturday with the new tweet parsing, and I've noticed that it only handles a small subset of TLDs. Basically, if you're not using a generic TLD (com/net/org) or one of the old vanity ones (aero/name/travel) plus a couple of country-abused-for-vanity TLDs (io/ly) then you're out of luck.

Example: Compose a new tweet and type:

http://www.bbc.co.uk/weather/2643743

Corebird says you have 104 characters left. Change co.uk to com:

http://www.bbc.com/weather/2643743

Corebird says you have 117 characters left (because URLs take 23).

The problem appears to be because of the arbitrarily short TLD list combined with the mandatory TLD checking after we've already determined that it is almost certainly a link because it starts with https?://.

If you're going to check TLDs then there's a really long list that needs adding (and maintaining), but I'd really advise against it!

IBBoard commented 6 years ago

For reference, it seems that Twitter does do exhaustive checking, because https://example.foo is valid (Google own it) but https://example.fibble isn't recognised. However it seems like something that either an existing, well-maintained and widely used library should do, or that we should just be lax on and let Twitter reject if it disagrees.

baedert commented 6 years ago

That's the exact same list I've been using in corebird-internal code before, it's not a new problem.

IBBoard commented 6 years ago

Interesting. Something changed, because it's definitely not counting UK domains as links. That's not a problem in my examples, but is a problem when you've typed 100 characters and your link (with path) is another 100 characters on a .co.uk domain!

Is it that the list was only used for domains without a protocol before, and now it's getting checked when there is a protocol as well?

(Aside: Twitter treats a plain bbc.co.uk as a link, and linkifies bbc.co but not the .u when you're part way through typing it)

baedert commented 6 years ago

Twitter uses different TLD lists for links with and without protocol. "foo.de" won't be a link while "http://foo.de" will. The old corebird code simply ignored the TLD when there was a protocol if I remember correctly.

Edit: The difference is apparently really just that twitter checks whether a domain is registered or not and foo.de isn't.

baedert commented 6 years ago

Okay I don't really get the rules here anymore. This tweet:

foobar.co
foobar.uk
foobar.de
foobar.org

Only makes the first and the last url a link. So obviously you don't need a protocol for a ccTLD (since .co is one). BUT "web.de" does not become a link either even though it's using a ccTLD and is clearly registered.

IBBoard commented 6 years ago

My guess is that tey've got a list like Corebird has, but it's based on domain popularity. Plain ".uk" domains are recent and barely taken up. Foobar.co.uk is treated as a link. ".co" is going to be linkified because of t.co.

IBBoard commented 6 years ago

Moved to the repo for the library.