strip whitespaces - Githubissues

firecat53 / urlscan

Mutt and terminal url selector (similar to urlview)

GNU General Public License v2.0

214 stars 38 forks source link

strip whitespaces #124

Closed balejk closed 2 years ago

balejk commented 2 years ago

I am using urlscan with neomutt and I received a HTML email, where the href attribute of the a HTML tag contains trailing and leading spaces. urlscan extracts the link along with the spaces, which causes xdg-open to consider it a filename rather than URL (specifically, the leading spaces cause the problem) and fail with file '...' does not exist.

Arguably, urlscan should always strip additional whitespaces from links as they are mostly not of interest (although apparently they are allowed inhrefs and can be taken advantage of for example for styling, see here) or at least offer a command line option to enable this behaviour.

firecat53 commented 2 years ago

Stripping spaces sounds reasonable. Do you have a real example URL that I can use to test with?

balejk commented 2 years ago

This is the most stripped down email message where it occurs:

Date: Mon, 1 Jan 1970 00:00:00 +0000 (UTC)
Content-Type: text/html; charset=UTF-8

<a href="                          http://example.net/                        ">example</a>

firecat53 commented 2 years ago

Interestingly, urlscan and xdg-open seem to handle that case just fine on my machine (Sway/Alacritty/Qutebrowser). This is what it looks like (you can see just one space before the URL)(I substituted google.com for the URL to have something to open):

However, both xdg-open and the python built in link opener handle it correctly.

Edit: On further testing (inserted pdb.set_trace() into urlchoose.py), it looks like spaces are already being stripped from the URL:

$ urlscan -n test_emails/test_spaces
> /home/firecat53/docs/family/scott/src/projects/urlscan/urlscan/urlchoose.py(858)process_urls()
-> return items, urls
(Pdb) urls
['https://google.com/']

Yeah, it looks like there's something else going on for you, because urlscan is definitely extracting the url correctly from that email without any trailing/leading spaces.

balejk commented 2 years ago

What does your input look like in full? Based on the context shown I believe it only contains the a HTML tag - please note that urlscan does not recognize that as HTML but simply as plain text and extracts the URL without parsing it as HTML (supposedly using some URL regular expression) which leads to spaces not being included.

This is why the Content-Type header in my example above is important (and the Date header is important as well - without it, urlscan does not seem to recognize that the input is in a header-content format and hence does not parse the Content-Type header and does not treat the content as HTML - specifying the From header instead seems to work as well so supposedly there just needs to be some other (arbitrary) header besides Content-Type).

firecat53 commented 2 years ago

Ah, shoot. I had a leading space in front of the Content-Type header so it didn't get recognized. I see what you're seeing now!

firecat53 commented 2 years ago

I pushed a fix to the develop branch. See if that works for you.

balejk commented 2 years ago

Yes, this seems to work, thank you!