chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 249 forks source link

Small bug in url detection regex #267

Closed BernierCR closed 5 years ago

BernierCR commented 5 years ago

steps to reproduce

sentence ="https://support.t-mobile.com/docs/DOC-2593" sentence = textacy.preprocessing.replace.replace_urls(sentence) sentence = textacy.preprocessing.replace.replace_numbers(sentence)

expected vs. actual behavior

I expect: sentence == 'url'

I get: sentence == 'https://support.t-_url_-_number_'

possible solution?

It should detect this entire thing as a url, but it's having trouble with -. It thinks only this part is the url.

mobile.com/docs/DOC

My regex skills are intermediate and this is an advanced regex,, but whoever wrote it should be able to update it to handle hyphens in urls.

Thank you.

bdewilde commented 5 years ago

Hey @BernierCR , you've found an unusual edge case in the code. The replace_urls() function actually applies two url regexes, first one for shortened URLs (like you get when sharing a link on, say, Twitter) then one for full URLs. The thing that's snagged you here is that shortened urls on Twitter follow a pattern like "t-urlstuff", which your example coincidentally has embedded within it, so the shortened URL pattern gets replaced and the full URL pattern doesn't match.

I've gotten expected behavior by swapping the order of the regexes: first the full URL is matched, then the shortened one. I think this is fine — at the very least, it works for your case, and all my tests pass. So, I'm going to merge it in and wish it luck.

Thanks for catching this bug!