Closed BernierCR closed 5 years ago
Hey @BernierCR , you've found an unusual edge case in the code. The replace_urls()
function actually applies two url regexes, first one for shortened URLs (like you get when sharing a link on, say, Twitter) then one for full URLs. The thing that's snagged you here is that shortened urls on Twitter follow a pattern like "t-urlstuff", which your example coincidentally has embedded within it, so the shortened URL pattern gets replaced and the full URL pattern doesn't match.
I've gotten expected behavior by swapping the order of the regexes: first the full URL is matched, then the shortened one. I think this is fine — at the very least, it works for your case, and all my tests pass. So, I'm going to merge it in and wish it luck.
Thanks for catching this bug!
steps to reproduce
sentence ="https://support.t-mobile.com/docs/DOC-2593" sentence = textacy.preprocessing.replace.replace_urls(sentence) sentence = textacy.preprocessing.replace.replace_numbers(sentence)
expected vs. actual behavior
I expect: sentence == 'url'
I get: sentence == 'https://support.t-_url_-_number_'
possible solution?
It should detect this entire thing as a url, but it's having trouble with -. It thinks only this part is the url.
mobile.com/docs/DOC
My regex skills are intermediate and this is an advanced regex,, but whoever wrote it should be able to update it to handle hyphens in urls.
Thank you.