Closed alter2000 closed 3 years ago
All the regex code is in urlscan/urlscan.py lines 261-286. These are the current horrific regexes:
URLINTERNALPATTERN = r'[{}()@\w/\\\-%?!&.=:;+,#~]'
URLTRAILINGPATTERN = r'[{}(@\w/\-%&=+#]'
HTTPURLPATTERN = (r'(?:(https?|file|ftps?)://' + URLINTERNALPATTERN +
r'*' + URLTRAILINGPATTERN + r')')
# Used to guess that blah.blah.blah.TLD is a URL.
....
TLDS = load_tlds()
GUESSEDURLPATTERN = (r'(?:[\w\-%]+(?:\.[\w\-%]+)*\.(?:' +
'|'.join(TLDS) + ')$)')
URLRE = re.compile(r'(?:<(?:URL:)?)?(' + HTTPURLPATTERN + '|' +
GUESSEDURLPATTERN +
r'|(?P<email>(mailto:)?[\w\-.]+@[\w\-.]*[\w\-]))>?',
flags=re.U)
Haven't touched these in quite awhile :D I've avoided adding the complexity of a config file up to now but I'm not sure a command line option would be particularly friendly for a regex. Looking at the regexes above, do you have a sense of what the regex might be to detect the URLs you would be filtering?
Thanks, will check out tomorrow. I guess only the HTTPURLPATTERN
and the compile call will have to be modified in the beginning for some of my use cases, but I can make a PR for XDG-compliant config some time next week.
Oof, well I'm kinda dumb...forgot that I added in a config file for people that use different palettes :roll_eyes: . I don't use it so it slipped my mind! So you could probably add something in there if that makes sense. It's just a json file.
I've looked at urlchooser.py and as far as I can see, I will have to add the logic to add a regex array to the config file. I was thinking about giving the user access to some of the prebuilt regexes somehow. What would you suggest, do I use an array in the JSON file to chop the regex to make it easier to read and understand, or cut it short somewhere else (maybe a separate file with just the regex), since JSON and regular expression storage don't go very well together without a load of backslashes?
Even though I don't think it's worth converting to YAML for just this, we could get by with treating the JSON as YAML somewhat easily.
Hmm...it'll be a bit before I can sit down and dig into this, but I have to ask...is this worth the effort for such a restricted use case? Would a slightly modified local version of urlscan installed as urlscan-https
be an easier solution?
I just have the free time and enough knowhow to be able to do this. Since I'm going to either use urlscan or urlview anyway, I thought about making it a public fork and eventually merging it.
Simply changing ~8 lines would be much much easier, but I'd rather make it more general (albeit with a chance of new bugs) for all than just changing 2 paragraphs. If you're okay with it, I can work on another config rule for the regex.
What about just using a separate config file 'customregex.py' that just contains those variables. Then if it exists, you can just reg = importlib.import_module('customregex')
and set the variables from the file instead. Seems like it's easier doing that then trying to figure out escaping for either a JSON or ConfigParser config file. We can just put a note in the manpage for advanced usage and add a command line switch to generate the customregex.py file.
Thoughts?
That seems like the best idea. Will get to it this weekend.
Hold off until you see some commits either on develop or master adding keybindings to the config file. I did some significant refactoring yesterday that hasn't been pushed to Github yet and that might affect what you're working on!
Just to add to this:
I personally don't have much of a use for scanning mailto: links in emails and I often find this clutters things up. I'd love to ignore mailto, either via modifying the regex or via something along the lines of a simple "--nomail" argument.
Thank you again for all your hard work on this!
+1 for an argument to ignore mailto.... Has anyone managed to do it?
I've finally got some free time now, so I'm fleshing out ideas to work on this week.
We could have configuration options in $XDG_CONFIG_DIR/urlscan
:
config.json
: PITA to write, edit and manage, would not recommendregex
) that overrides the default config
And/or as a flag:
I don't know which one is the best fit, since for my use case I'm just hardcoding my regex into the file, although it would be useful to others.
I think my vote is still for doing as I described above:
What about just using a separate config file 'customregex.py' that just contains those variables. Then if it exists, you can just
reg = importlib.import_module('customregex')
and set the variables from the file instead. Seems like it's easier doing that then trying to figure out escaping for either a JSON or ConfigParser config file. We can just put a note in the manpage for advanced usage and add a command line switch to generate the customregex.py file.
I think this issue might be pretty common. I integrate urslcan
with mutt
, and it gets its most usage with HTML email with embedded links. Picking some recent HTML emails and sending them through urlscan
I wind up with multiple pages of links, many of which are are mailto:
or href
links in HTML tags, sometimes surrounded by multiple pages of CSS code and other similar things that a regex could help with.
And/or as a flag:
- path to a file
- Python regex string as argument
I'd prefer this approach as it can be generalized/repurposed for the most different use cases.
E.g. I have a keybinding for piping terminal buffer into urlscan, and it interprets stuff like some_archive.zip
as URL, which I don't desire obviously.
It would help if I could just --regex='http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
.
I'll see if I can come up with a functional PR.
@rpolve - Perhaps having the regex option available in two places: 1. In the existing config file for global regex changes. and 2. as a command line switch which would override the config file (for special use cases...like processing the terminal buffer vs general email links).
What do you think? Thanks for your interest!!
In the existing config file for global regex changes
Sorry, do you mean the --genconf
one? Or something else?
Hi. Did you have any chance to take a look at PR #102?
Studying for promotional exam. It'll be a month or so before I sit down to any of projects. Sorry!
I have a couple of use cases here: one that wants only https?:// URLs, one for everything except mailto: links, and any custom protocol prefix. The easiest way to fit all this in is with a custom regex rather than putting them as options. I haven't looked into the code to see how feasible this is and how it works right now, so it might even be easier to create something else entirely to handle custom regexes.
Or is it already implemented? I didn't see anything in the issues and google.
Anyway, thanks for the work. It's really nice.