dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.07k stars 4.69k forks source link

Consider augmenting regex tests with "polyglot regex corpus" #62971

Open stephentoub opened 2 years ago

stephentoub commented 2 years ago

This is a corpus of over 500k regexes in 8 different dialects: https://github.com/SBULeeLab/LinguaFranca-FSE19/blob/master/data/production-regexes/uniq-regexes-8.json from the paper: https://people.cs.vt.edu/~davisjam/downloads/publications/DavisMichaelCoghlanServantLee-LinguaFranca-ESECFSE19.pdf We should consider using them to augment our testing of Regex.

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions See info in area-owners.md if you want to be subscribed.

Issue Details
This is a corpus of over 500k regexes from 8 different dialects: https://github.com/SBULeeLab/LinguaFranca-FSE19/blob/master/data/production-regexes/uniq-regexes-8.json from the paper: https://people.cs.vt.edu/~davisjam/downloads/publications/DavisMichaelCoghlanServantLee-LinguaFranca-ESECFSE19.pdf We should consider using them to augment our testing of Regex.
Author: stephentoub
Assignees: -
Labels: `area-System.Text.RegularExpressions`
Milestone: -
danmoseley commented 1 year ago

@stephentoub this repo has an MIT license, but it seems these regex were harvested from all over. Do we have confidence that they all have permissive licensing?

stephentoub commented 1 year ago

I've not investigated.

danmoseley commented 1 year ago

@davisjam -- I know you and @stephentoub briefly discussed your paper a few years back, I wonder whether there's anything you can say about the licenses which might apply to your 500K harvested regexes. For us to copy them into tests here, they need to be under a permissive license.

I'm guessing they're collected from projects with all kinds of licenses (or no declared license) so there's no way to identify which are permissively licensed?

Another thought -- were there any projects you found that had 100's or 1000's of regexes, where we could go check the license explicitly? We did grab a few such as an AT&T corpus, but perhaps there's another we didn't see yet.

davisjam commented 1 year ago

@danmoseley They were harvested without an eye to license. However, I'm in the process of updating the corpus and we can grab the licenses along the way. Do you have a specific definition of "permissive"?

danmoseley commented 1 year ago

Do you have a specific definition of "permissive"?

It's essentially excluding copy-left licenses. Licenses like MIT, BSD and Apache are acceptable.

@richlander do we have a definition or list of licenses that we consider compatible with our project?

danmoseley commented 1 year ago

updating the corpus and we can grab the licenses along the way.

That would be super helpful -- could you perhaps also tag with origin (eg., link to repo) along the way?

danmoseley commented 1 year ago

BTW, you almost surely know as you work with @veanes but their non-backtracking engine shipped in .NET 7: https://devblogs.microsoft.com/dotnet/regular-expression-improvements-in-dotnet-7/#backtracking-and-regexoptions-nonbacktracking

Another regex improvement that shipped in.NET 7 was an optional "source generator" that at compile time translates patterns into code rather than generating code at runtime. That can give startup/perf improvements but it is also a nice way to see what the "real" engine does, as (so far) the behaviors are quite closely aligned.

https://devblogs.microsoft.com/dotnet/regular-expression-improvements-in-dotnet-7/#source-generation

The regex codebase is now quite clean and hackable, so regex continues to be an area of research interests for you (or your students) as it sounds like it is, perhaps it might be interesting for them to hack on it as a tool for research.

davisjam commented 1 year ago

@danmosely I'll ping when the corpus is updated. Will be "sometime this summer"

danmoseley commented 10 months ago

@davisjam wondering whether you had a chance to annotate with license as you thought you might?