firecat53 / urlscan

Mutt and terminal url selector (similar to urlview)
GNU General Public License v2.0
213 stars 37 forks source link

Customize regular expression to match URLs #79

Closed alter2000 closed 3 years ago

alter2000 commented 5 years ago

I have a couple of use cases here: one that wants only https?:// URLs, one for everything except mailto: links, and any custom protocol prefix. The easiest way to fit all this in is with a custom regex rather than putting them as options. I haven't looked into the code to see how feasible this is and how it works right now, so it might even be easier to create something else entirely to handle custom regexes.
Or is it already implemented? I didn't see anything in the issues and google.

Anyway, thanks for the work. It's really nice.

firecat53 commented 5 years ago

All the regex code is in urlscan/urlscan.py lines 261-286. These are the current horrific regexes:

URLINTERNALPATTERN = r'[{}()@\w/\\\-%?!&.=:;+,#~]'
URLTRAILINGPATTERN = r'[{}(@\w/\-%&=+#]'
HTTPURLPATTERN = (r'(?:(https?|file|ftps?)://' + URLINTERNALPATTERN +
                  r'*' + URLTRAILINGPATTERN + r')')
# Used to guess that blah.blah.blah.TLD is a URL.
....
TLDS = load_tlds()
GUESSEDURLPATTERN = (r'(?:[\w\-%]+(?:\.[\w\-%]+)*\.(?:' +
                     '|'.join(TLDS) + ')$)')
URLRE = re.compile(r'(?:<(?:URL:)?)?(' + HTTPURLPATTERN + '|' +
                   GUESSEDURLPATTERN +
                   r'|(?P<email>(mailto:)?[\w\-.]+@[\w\-.]*[\w\-]))>?',
                   flags=re.U)

Haven't touched these in quite awhile :D I've avoided adding the complexity of a config file up to now but I'm not sure a command line option would be particularly friendly for a regex. Looking at the regexes above, do you have a sense of what the regex might be to detect the URLs you would be filtering?

alter2000 commented 5 years ago

Thanks, will check out tomorrow. I guess only the HTTPURLPATTERN and the compile call will have to be modified in the beginning for some of my use cases, but I can make a PR for XDG-compliant config some time next week.

firecat53 commented 5 years ago

Oof, well I'm kinda dumb...forgot that I added in a config file for people that use different palettes :roll_eyes: . I don't use it so it slipped my mind! So you could probably add something in there if that makes sense. It's just a json file.

alter2000 commented 5 years ago

I've looked at urlchooser.py and as far as I can see, I will have to add the logic to add a regex array to the config file. I was thinking about giving the user access to some of the prebuilt regexes somehow. What would you suggest, do I use an array in the JSON file to chop the regex to make it easier to read and understand, or cut it short somewhere else (maybe a separate file with just the regex), since JSON and regular expression storage don't go very well together without a load of backslashes?

Even though I don't think it's worth converting to YAML for just this, we could get by with treating the JSON as YAML somewhat easily.

firecat53 commented 5 years ago

Hmm...it'll be a bit before I can sit down and dig into this, but I have to ask...is this worth the effort for such a restricted use case? Would a slightly modified local version of urlscan installed as urlscan-https be an easier solution?

alter2000 commented 5 years ago

I just have the free time and enough knowhow to be able to do this. Since I'm going to either use urlscan or urlview anyway, I thought about making it a public fork and eventually merging it.

Simply changing ~8 lines would be much much easier, but I'd rather make it more general (albeit with a chance of new bugs) for all than just changing 2 paragraphs. If you're okay with it, I can work on another config rule for the regex.

firecat53 commented 5 years ago

What about just using a separate config file 'customregex.py' that just contains those variables. Then if it exists, you can just reg = importlib.import_module('customregex') and set the variables from the file instead. Seems like it's easier doing that then trying to figure out escaping for either a JSON or ConfigParser config file. We can just put a note in the manpage for advanced usage and add a command line switch to generate the customregex.py file.

Thoughts?

alter2000 commented 5 years ago

That seems like the best idea. Will get to it this weekend.

firecat53 commented 5 years ago

Hold off until you see some commits either on develop or master adding keybindings to the config file. I did some significant refactoring yesterday that hasn't been pushed to Github yet and that might affect what you're working on!

rslindee commented 5 years ago

Just to add to this:

I personally don't have much of a use for scanning mailto: links in emails and I often find this clutters things up. I'd love to ignore mailto, either via modifying the regex or via something along the lines of a simple "--nomail" argument.

Thank you again for all your hard work on this!

rafaeluriarte commented 5 years ago

+1 for an argument to ignore mailto.... Has anyone managed to do it?

alter2000 commented 5 years ago

I've finally got some free time now, so I'm fleshing out ideas to work on this week.

We could have configuration options in $XDG_CONFIG_DIR/urlscan:

And/or as a flag:

I don't know which one is the best fit, since for my use case I'm just hardcoding my regex into the file, although it would be useful to others.

firecat53 commented 5 years ago

I think my vote is still for doing as I described above:

What about just using a separate config file 'customregex.py' that just contains those variables. Then if it exists, you can just reg = importlib.import_module('customregex') and set the variables from the file instead. Seems like it's easier doing that then trying to figure out escaping for either a JSON or ConfigParser config file. We can just put a note in the manpage for advanced usage and add a command line switch to generate the customregex.py file.

kylebarbour commented 4 years ago

I think this issue might be pretty common. I integrate urslcan with mutt, and it gets its most usage with HTML email with embedded links. Picking some recent HTML emails and sending them through urlscan I wind up with multiple pages of links, many of which are are mailto: or href links in HTML tags, sometimes surrounded by multiple pages of CSS code and other similar things that a regex could help with.

rpolve commented 3 years ago

And/or as a flag:

  • path to a file
  • Python regex string as argument

I'd prefer this approach as it can be generalized/repurposed for the most different use cases.

E.g. I have a keybinding for piping terminal buffer into urlscan, and it interprets stuff like some_archive.zip as URL, which I don't desire obviously.

It would help if I could just --regex='http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'.

I'll see if I can come up with a functional PR.

firecat53 commented 3 years ago

@rpolve - Perhaps having the regex option available in two places: 1. In the existing config file for global regex changes. and 2. as a command line switch which would override the config file (for special use cases...like processing the terminal buffer vs general email links).

What do you think? Thanks for your interest!!

rpolve commented 3 years ago

In the existing config file for global regex changes

Sorry, do you mean the --genconf one? Or something else?

rpolve commented 3 years ago

Hi. Did you have any chance to take a look at PR #102?

firecat53 commented 3 years ago

Studying for promotional exam. It'll be a month or so before I sit down to any of projects. Sorry!