Open kaspersorensen opened 6 years ago
I made a little tool to spit out some tentative output that would fit into the conf.xml format. Not great code, but works as a one-off:
import java.io.File;
import java.io.FileInputStream;
import org.datacleaner.util.StringUtils;
import org.datacleaner.util.xml.XmlUtils;
import org.junit.Test;
import org.springframework.util.xml.DomUtils;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
public class RegexSwapDataExtractor {
@Test
public void testExtractRegexPatterns() throws Exception {
final File regexSwapDump = new File("src/test/resources/old-regexswap-patterns.xml");
final Document doc = XmlUtils.parseDocument(new FileInputStream(regexSwapDump));
final NodeList regexNodes = doc.getElementsByTagName("regex");
for (int i = 0; i < regexNodes.getLength(); i++) {
final Node regexNode = regexNodes.item(i);
final String str = toConfXmlRegexPattern((Element) regexNode);
System.out.print(str);
}
}
private String toConfXmlRegexPattern(Element regexNode) {
// <regex-pattern name="Website URL" description="Matches a HTTP or HTTPS based URL for a website. Does not
// handle HTTP query parameters.">
// <expression>^https?://[a-z0-9_-][\.[a-z0-9_-]]*\.(com|edu|org|net|int|info|eu|biz|mil|gov|aero|travel|pro|name|museum|coop|asia|[a-z][a-z])+(:[0-9]+)?[/[a-zA-Z0-9\._#-]]*/?$</expression>
// </regex-pattern>
String name = DomUtils.getChildElementByTagName(regexNode, "name").getTextContent();
name = StringUtils.replaceAll(StringUtils.replaceWhitespaces(name, " "), " ", " ");
name = trim(name);
String expression = DomUtils.getChildElementByTagName(regexNode, "expression").getTextContent();
expression = trim(expression);
String description = DomUtils.getChildElementByTagName(regexNode, "description").getTextContent();
description = trim(description);
if (description.indexOf('\n') != -1) {
description = null;
}
return (description == null ? "\n<regex-pattern name=\"" + name + "\">"
: "\n<regex-pattern name=\"" + name + "\" description=\"" + description + "\">") + "\n\t<expression>" + expression + "</expression>"
+ "\n</regex-pattern>";
}
private String trim(String str) {
str = str.trim();
str = str.replace("<", "<").replace(">", ">");
return str;
}
}
I'm not sure it really matters, since no one ever really contributed patterns, but wouldn't RegexSwap be really easy to just dump in a GitHub Page? That way, it's also easy to contribute a pattern, just make a PR against the RegexSwap GH page repo.
(of course, dynamic things like voting would not survive, but I'm not sure that's really a big loss. There's the possibility of discussing them in the repo's issues list, and improving them through a PR. This seems better than a simple voting system)
Good point. We could even put it up on https://datacleaner.github.io somewhere, just like the new version endpoint that it has (https://datacleaner.github.io/meta/versions.json) which I intended for something similar (update notifications).
Cool. Both separate and combined makes perfect sense, so let's go with whatever you prefer :)
I've made the regexes available at https://datacleaner.github.io/content/regexes.json
Or for source code access: https://github.com/datacleaner/datacleaner.github.io/blob/master/content/regexes.json
It seems to me that they're not very well maintained though. I'm gonna do a bit of cleanup in the descriptions and such, but I'm sure more people than me can help too, so let this be an open invite to any contributor to pitch in with their good regex contributions :-)
Now that RegexSwap is no longer available, should we just put all those regexes into the application itself?
I've gone ahead and queried the regexes just to be able to preserve them for future use: