Small bug in url detection regex

chartbeat-labs / textacy

NLP, before and after spaCy

Other

2.21k stars 249 forks source link

Hey @BernierCR , you've found an unusual edge case in the code. The replace_urls() function actually applies two url regexes, first one for shortened URLs (like you get when sharing a link on, say, Twitter) then one for full URLs. The thing that's snagged you here is that shortened urls on Twitter follow a pattern like "t-urlstuff", which your example coincidentally has embedded within it, so the shortened URL pattern gets replaced and the full URL pattern doesn't match.

I've gotten expected behavior by swapping the order of the regexes: first the full URL is matched, then the shortened one. I think this is fine — at the very least, it works for your case, and all my tests pass. So, I'm going to merge it in and wish it luck.

Thanks for catching this bug!

chartbeat-labs / textacy

Small bug in url detection regex #267

steps to reproduce

expected vs. actual behavior

possible solution?