cloudera-labs / envelope

Build configuration-driven ETL pipelines on Apache Spark
Apache License 2.0
158 stars 89 forks source link

NPE on RegexRowRule when a row field is empty #24

Open michelemilesi opened 6 years ago

michelemilesi commented 6 years ago

Regex rule fails with an NPE if a row contains an empty field and the field must be checked.

18/04/11 17:57:52 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.0 (TID 3, ****.****.**, executor 4): java.lang.NullPointerException
        at java.util.regex.Matcher.getTextLength(Matcher.java:1283)
        at java.util.regex.Matcher.reset(Matcher.java:309)
        at java.util.regex.Matcher.<init>(Matcher.java:229)
        at java.util.regex.Pattern.matcher(Pattern.java:1093)
        at com.cloudera.labs.envelope.derive.dq.RegexRowRule.check(RegexRowRule.java:55)

The class com.cloudera.labs.envelope.derive.dq.RegexRowRule need a null check in the method check(Row row): boolean:

  @Override
  public boolean check(Row row) {
    boolean check = true;
    for (String field : fields) {
      String value = row.getAs(field);
      Matcher matcher = pattern.matcher(value);
      check = check && matcher.matches();
      if (!check) {
        // No point continuing if failed
        break;
      }
    }
    return check;
  }