Closed MrOrz closed 7 months ago
Hey, I'd like to send a PR. Do you think linkify-react would be a great solution for this?
Hi @weii41392 thanks for taking a look at the issue!
The currently implemented linkify
also shortens URL in a way that the user can still copy-paste the full URL by selecting the text. Introducing a whole library and trying to integrate it while retaining the original behavior may be quite challenging.
On the other hand, simplify fixing the regexp used in current implementation may be a more straight forward solution -- we can focus more on how to correctly spliiting the URLs rather than worry about integrating anything.
I understand that introducing a new library for a single function would be a pain, but when I was searching for URL regex patterns, I found each of them sacrifices in some way.
I find linkify a good choice as it was built exactly for linkifying plaintext and is implemented with a parser, which I believe can take care of context more than regex.
P.S. At the time I was writing this comment I found that linkify also failed to handle 全形 parentheses, so I guess we can either choose some regex from the above list or implement some parser for this.
Thanks for the investigation! The URL regex patterns looks cool.
From the original request we should not include 全形 )
characters in the URL: https://www.facebook.com/groups/cofacts/posts/3648747782023691/
When fact checkers copy-paste URLs from the browser URL bar, the URLs should always be URL encoded (with %XX
). However, if the user copy paste text from another reply on Cofacts, then it may copy the decoded URL, since we display all URLs in decoded format (it makes URLs more human readable). Dropping all 全型 characters may break URLs with decoded 全型 inside.
Personally I think that when users provide URLs, they may do things like "請參考http://google.com。", 「(請參考http://google.com)」 . Therefore, maybe we can cut down URLs when encountering punctuations (both 半形 and 全形) .
Suddenly I recall that we are extracting hyperlinks in rumors-api as well, because we scrap the URL content when new messages and replies are created. It seems that url-regex is used. Would you test if it satisfies the need?
url-regex seems to be one of the regex patterns in the above list.
In the following I test both url-regex and linkify in the presence of halfwidth/fullwidth parentheses.
In a nutshell, url-regex doesn't work well with both halfwidth/fullwidth parentheses and linkify can handle halfwidth parentheses.
I guess this is because halfwidth parenthesis is defined as non-accepting symbol here while fullwidth parenthesis isn't. Note that the same issue may also happen to braces({}
), brackets([]
), and angle brackets (<>
).
I would lean towards using linkify and sanitizing the results (say removing closing symbols).
import urlRegex from 'url-regex';
import { tokenize } from "linkifyjs";
const links = [
"http://foo.com/blah_blah",
"http://foo.com/blah_blah_(wikipedia)_(again)"
];
const texts = [
`${links[0]} ${links[1]}`,
`網址1(${links[0]}) 網址2(${links[1]})`,
`網址1(${links[0]}) 網址2(${links[1]})`,
];
for (const text of texts) {
const urls = text.match(urlRegex()) || [];
urls.forEach((url) => console.log(`"${url}"`));
}
// url-regex
// texts[0]: succeed without parentheses
// "http://foo.com/blah_blah"
// "http://foo.com/blah_blah_(wikipedia)_(again)"
// texts[1]: fail to handle halfwidth parentheses
// "http://foo.com/blah_blah)"
// "http://foo.com/blah_blah_(wikipedia)_(again))"
// texts[2]: fail to handle fullwidth parentheses
// "http://foo.com/blah_blah)"
// "http://foo.com/blah_blah_(wikipedia)_(again))"
for (const text of texts) {
const tokens = tokenize(text);
tokens.filter(token => token.isLink).forEach((token) => console.log(`"${token.v}"`));
}
// linkifyjs
// texts[0]: succeed without parentheses
// "http://foo.com/blah_blah"
// "http://foo.com/blah_blah_(wikipedia)_(again)"
// texts[1]: succeed with halfwidth parentheses
// "http://foo.com/blah_blah"
// "http://foo.com/blah_blah_(wikipedia)_(again)"
// texts[2]: fail to handle fullwidth parentheses
// "http://foo.com/blah_blah)"
// "http://foo.com/blah_blah_(wikipedia)_(again))"
P.S. Add a reference to the issue to linkify: Hypercontext/linkifyjs#460
Introducing a whole library and trying to integrate it while retaining the original behavior may be quite challenging.
Also, code changes to linkify() would not be an issue. (Sadly, this implementation doesn't work with fullwidth symbols...)
import { tokenize } from 'linkifyjs';
const tokenized = tokenize(str).map((token, i) =>
token.isLink ? (
<a key={`link${i}`} href={token.v} {...props}>
{shortenUrl(token.v, maxLength)}
</a>
) : (
token.toString()
)
);
Thank you for your thorough investigation and demonstration! My previous comment was towards linkify-react. tokenize()
from linkifyjs
looks very promising! Let's proceed with linkifyjs
:+1:
Tried to type the link from the facebook to LINE and Telegram. Both messengers included full-width parenthesis to the url:
So, I suppose that should be the standard way to handle full-width parenthesis in url. Since also none of the regexp nor linkifyjs support them, it should be no need of handling it for us.
Though half-width brackets aren't handled correctly on the webiste:
The issue in linkifyjs is still open, but it's related to full-width brackets inside url, which seems not to be a standard.
So what I suggest is to use linkifyjs to handle half-width brackets in correct way, and forget about full-width brackets, since nobody supports them. @MrOrz What do you think?
Thanks for the analysis. I think we can proceed with linkifyjs
:+1:
Ok. I will create a PR by the end of this week
In https://www.facebook.com/groups/cofacts/posts/3648747782023691/?__cft__[0]=AZWNmv5K_H7F-skP4SOkIgZkb_Zv2i6ot3SXeHigYKawA2MnWSlDmycGq3hfNilD_slvYWqz1M-TriCfusgM7iiguSYqfbf8hBuuDN7Jx98GrMObD8796wLbjw5EJpsyuCzpU12KXm3U_jICgBIgX7KrpyAMBof29c6JJxdT4fDSyBQGDqA4okhj5v4I9uaQN3U&__tn__=%2CO%2CP-R , Cofacts collaborators has pointed out that LINE can actually break URLs and Mandarin characters as expected.
On the contrary, currently our URL matching mechanism just matches all non-space characters following
http://
andhttps://
. We should improve this so that URLs followed by Mandarin characters don't break on the website.