Open stephentoub opened 2 years ago
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions See info in area-owners.md if you want to be subscribed.
Author: | stephentoub |
---|---|
Assignees: | - |
Labels: | `area-System.Text.RegularExpressions` |
Milestone: | - |
@stephentoub this repo has an MIT license, but it seems these regex were harvested from all over. Do we have confidence that they all have permissive licensing?
I've not investigated.
@davisjam -- I know you and @stephentoub briefly discussed your paper a few years back, I wonder whether there's anything you can say about the licenses which might apply to your 500K harvested regexes. For us to copy them into tests here, they need to be under a permissive license.
I'm guessing they're collected from projects with all kinds of licenses (or no declared license) so there's no way to identify which are permissively licensed?
Another thought -- were there any projects you found that had 100's or 1000's of regexes, where we could go check the license explicitly? We did grab a few such as an AT&T corpus, but perhaps there's another we didn't see yet.
@danmoseley They were harvested without an eye to license. However, I'm in the process of updating the corpus and we can grab the licenses along the way. Do you have a specific definition of "permissive"?
Do you have a specific definition of "permissive"?
It's essentially excluding copy-left licenses. Licenses like MIT, BSD and Apache are acceptable.
@richlander do we have a definition or list of licenses that we consider compatible with our project?
updating the corpus and we can grab the licenses along the way.
That would be super helpful -- could you perhaps also tag with origin (eg., link to repo) along the way?
BTW, you almost surely know as you work with @veanes but their non-backtracking engine shipped in .NET 7: https://devblogs.microsoft.com/dotnet/regular-expression-improvements-in-dotnet-7/#backtracking-and-regexoptions-nonbacktracking
Another regex improvement that shipped in.NET 7 was an optional "source generator" that at compile time translates patterns into code rather than generating code at runtime. That can give startup/perf improvements but it is also a nice way to see what the "real" engine does, as (so far) the behaviors are quite closely aligned.
https://devblogs.microsoft.com/dotnet/regular-expression-improvements-in-dotnet-7/#source-generation
The regex codebase is now quite clean and hackable, so regex continues to be an area of research interests for you (or your students) as it sounds like it is, perhaps it might be interesting for them to hack on it as a tool for research.
@danmosely I'll ping when the corpus is updated. Will be "sometime this summer"
@davisjam wondering whether you had a chance to annotate with license as you thought you might?
This is a corpus of over 500k regexes in 8 different dialects: https://github.com/SBULeeLab/LinguaFranca-FSE19/blob/master/data/production-regexes/uniq-regexes-8.json from the paper: https://people.cs.vt.edu/~davisjam/downloads/publications/DavisMichaelCoghlanServantLee-LinguaFranca-ESECFSE19.pdf We should consider using them to augment our testing of Regex.