apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.49k stars 1.28k forks source link

IngestionJobSpec: includeFileNamePattern with Regex does not work as documented #10611

Open travis-cook-sfdc opened 1 year ago

travis-cook-sfdc commented 1 year ago

According to the docs, includeFileNamePattern and excludeFileNamePattern are documented like:

Only Files matching this pattern will be included from inputDirURI. Both glob and regex patterns are supported. Examples: Use 'glob:.avro'or 'regex:^..(avro)$' to include all avro files one level deep in the inputDirURI. Alternatively, use 'glob:*/.avro' to include all the avro files in inputDirURI as well as its subdirectories - bear in mind that, with this approach, the pattern needs to match the absolute path. You can use Glob tool or Regex Tool to test out your patterns.

A few issues here:

1️⃣ The example of regex:^..(avro)$ does not actually work. When running a job with this pattern, you'll get an error like this

Caused by: groovy.lang.GroovyRuntimeException: Failed to parse template script (your template may contain an error or be trying to use expressions not currently supported): startup failed:
SimpleTemplateScript1.groovy: 1: illegal string body character after dollar sign;
   solution: either escape a literal dollar sign "\$5" or bracket the value expression "${5}" @ line 1, column 10.
   out.print("""
            ^

1 error

I'm assuming this because of the templating that was introduced in #5341 (also not documented) , but job spec's appear to have special handling for both $, which needs to be escaped: \$, and backslashes which are automatically escaped to \\

2️⃣ Related to the above, it's not clear how someone would write a single backslash character in their regex. For example, I think this is an impossible regex to use .*\.parquet$ because it's not clear how to get the single backslash character. \ turns into \\ and \\ stays as \\. This issue can be worked around by using character classes and writing .*[.]parquet$, but it feels wrong.

3️⃣ What flavor of regex is actually being used here? regextester.com linked in the documentation only supports PCRE and Javascript regex. However, I suspect this really java regex, which has different syntax. Given the code uses PathMatcher, it's java regex. Pinot should link to a regex tester that will be accurate

4️⃣ Can you provide some examples of the absolute path I should be matching to? I've submitted an ingestion job spec that has includeFileNamePattern: regex:^s3://redactedCompanyName/metrics_rollup_dev/redactedTableName/v/4/ds=(2023-03-02)/.*[.]parquet$

I have an s3 file with the following name at the path: s3://redactedCompanyName/metrics_rollup_dev/redactedTableName/v/4/ds=2023-03-02/part-00000-d60ed2b8-30cd-4e7c-82e0-309f854991f5.c000.gz.parquet

According to regex101.com, this is a match using Java8 syntax: https://regex101.com/r/9ZKOhm/1

It's unclear to me what I'm doing wrong that's causing this pattern to not match.

travis-cook-sfdc commented 1 year ago

I spent a little bit more time with this and now understand why 4️⃣ was an issue.

public class FileTest {

    public static void matches(Path path, String glob){
        PathMatcher matcher = FileSystems.getDefault().getPathMatcher(glob);
        System.out.println(matcher.matches(path));
    }
    public static void main(String[] args) throws IOException {
        Path path = Paths.get("s3://redactedCompanyName/metrics_rollup_dev/redactedTableName/v/4/ds=2023-03-02/part-00000-d60ed2b8-30cd-4e7c-82e0-309f854991f5.c000.gz.parquet");
        System.out.println(path.toString());
        matches(path, "regex:^s3://redactedCompanyName/metrics_rollup_dev/redactedTableName/v/4/ds=(2023-03-02)/.*[.]parquet$");
        matches(path, "regex:^s3:/redactedCompanyName/metrics_rollup_dev/redactedTableName/v/4/ds=(2023-03-02)/.*[.]parquet$");
    }
}

FileTest.main(new String[] {})

Returns

s3:/redactedCompanyName/metrics_rollup_dev/redactedTableName/v/4/ds=2023-03-02/part-00000-d60ed2b8-30cd-4e7c-82e0-309f854991f5.c000.gz.parquet
false
true

Because Pinot regex matches on the Java Path object using getPathMatcher, and java path's convert // to /, it's critical that the regex matches that are sent for ingestion are aware of that fact.

I think it would be useful to clean up the documentation significantly.