Closed balejk closed 2 years ago
Stripping spaces sounds reasonable. Do you have a real example URL that I can use to test with?
This is the most stripped down email message where it occurs:
Date: Mon, 1 Jan 1970 00:00:00 +0000 (UTC)
Content-Type: text/html; charset=UTF-8
<a href=" http://example.net/ ">example</a>
Interestingly, urlscan and xdg-open seem to handle that case just fine on my machine (Sway/Alacritty/Qutebrowser). This is what it looks like (you can see just one space before the URL)(I substituted google.com for the URL to have something to open):
However, both xdg-open and the python built in link opener handle it correctly.
Edit: On further testing (inserted pdb.set_trace() into urlchoose.py), it looks like spaces are already being stripped from the URL:
$ urlscan -n test_emails/test_spaces
> /home/firecat53/docs/family/scott/src/projects/urlscan/urlscan/urlchoose.py(858)process_urls()
-> return items, urls
(Pdb) urls
['https://google.com/']
Yeah, it looks like there's something else going on for you, because urlscan is definitely extracting the url correctly from that email without any trailing/leading spaces.
What does your input look like in full? Based on the context shown I believe it only contains the a
HTML tag - please note that urlscan
does not recognize that as HTML but simply as plain text and extracts the URL without parsing it as HTML (supposedly using some URL regular expression) which leads to spaces not being included.
This is why the Content-Type
header in my example above is important (and the Date
header is important as well - without it, urlscan
does not seem to recognize that the input is in a header-content format and hence does not parse the Content-Type
header and does not treat the content as HTML - specifying the From
header instead seems to work as well so supposedly there just needs to be some other (arbitrary) header besides Content-Type
).
Ah, shoot. I had a leading space in front of the Content-Type header so it didn't get recognized. I see what you're seeing now!
I pushed a fix to the develop branch. See if that works for you.
Yes, this seems to work, thank you!
I am using
urlscan
withneomutt
and I received a HTML email, where thehref
attribute of thea
HTML tag contains trailing and leading spaces.urlscan
extracts the link along with the spaces, which causesxdg-open
to consider it a filename rather than URL (specifically, the leading spaces cause the problem) and fail withfile '...' does not exist
.Arguably,
urlscan
should always strip additional whitespaces from links as they are mostly not of interest (although apparently they are allowed inhref
s and can be taken advantage of for example for styling, see here) or at least offer a command line option to enable this behaviour.