EFForg / https-everywhere

A browser extension that encrypts your communications with many websites that offer HTTPS but still allow unencrypted connections.
https://eff.org/https-everywhere
Other
3.37k stars 1.09k forks source link

(?:$|\?) SyntaxError #18898

Closed jayvdb closed 2 years ago

jayvdb commented 4 years ago

Type: code issue

While working through https://github.com/EFForg/https-everywhere/issues/18897, experimenting with the exclusion from Vox_Media.com.xml in node v8.14.0, the result was a SyntaxError, due to (?:$|\?)

> RegExp("^http://(?:www\.)?voxmedia\.com/(?!(?!.+\.js(?:$|\?)|(?:.*/)?\d{4}/)/*(?:(?:about-vox-media|admin|brands|contact|google_news_sitemap|login|networks|pages|press|racked|reader|recode|rss|security|the-verge|trending|users|vox-creative)(?:$|[?/])|robots\.txt))")
SyntaxError: Invalid regular expression: /^http://(?:www.)?voxmedia.com/(?!(?!.+.js(?:$|?)|(?:.*/)?d{4}/)/*(?:(?:about-vox-media|admin|brands|contact|google_news_sitemap|login|networks|pages|press|racked|reader|recode|rss|security|the-verge|trending|users|vox-creative)(?:$|[?/])|robots.txt))/: Nothing to repeat
> RegExp('(?:$|\?)')
SyntaxError: Invalid regular expression: /(?:$|?)/: Nothing to repeat
> RegExp('(?:\?)')
SyntaxError: Invalid regular expression: /(?:?)/: Nothing to repeat

Node v10.16.0 has similar results for me

> RegExp('(?:$|\?)')
Thrown:
SyntaxError: Invalid regular expression: /(?:$|?)/: Nothing to repeat

When I use JS /../ syntax, the snippet works

> /(?:$|\?)/
/(?:$|\?)/

But using /../ syntax on the longer expressions in the XML fails because / is used without backquoting as \/.

It appears that \? is not literal ? when used in that manner. Replacing it with [?] works.

> RegExp('(?:$|[?])')
/(?:$|[?])/

There are a lot of rules using that (?:$|\?) construct.

> fgrep '(?:$|\?)' *.xml | wc -l
197

(A significant number of those matches are disabled rulesets or commented out from/exclusion, but the scale of the possible problem is clear.)

I assume that there is something I am missing here. I havent done a proper build to test these regex in a browser. Apologies in advance if this is a non-issue.

zoracon commented 4 years ago

Sorry I think I may missing a link, but I am not sure why you are using Node? Either way, these rulesets are compiled into a JSON file that is later called and parsed.

I ran a git log on Vox_Media.com.xml, and it looks like the last time this files was touched was 2016. Since then we have been encouraging users to stop the practice of utilizing really complex regex. So, this may be a case of breaking down and updating this file to the newer standards. https://github.com/EFForg/https-everywhere/blob/master/CONTRIBUTING.md#complicated-regex-in-rules

jayvdb commented 4 years ago

This isnt an isolated case. There are ~197 cases. They appear also in the JSON.

The regex should work in Node as there are npm packages which assume that they work in Node.

zoracon commented 4 years ago

I saw that this is on 197 files, what I am getting at is that I'm betting those files, like Vox_Media.com.xml, are from an older time in the repo where complex regex was accepted. I am not going to pretend to know the many regex nuances between languages since they get parsed slightly differently language to language. In this case I am asking is this breaking the rule itself? If so, feel free to put up a PR or I can attend to these files as well.